data:image/s3,"s3://crabby-images/42ca7/42ca742da9162209d1506d2d83c78c3343baece3" alt=""
Sam Horradarn
AI & ML interests
Recent Activity
Organizations
sirahd's activity
data:image/s3,"s3://crabby-images/42ca7/42ca742da9162209d1506d2d83c78c3343baece3" alt=""
How can we find the chunk content using chunk hash?
Chunk hash is calculated via content-defined chunking (CDC), which means that if two chunks have the same content they will share the same hash. CDC removes the need to store the mapping between chunk hash -> chunk content because we know if two chunks share the same hash, they will have identical content.
The CAS system only stores "block_hash -> block_content", Where does the map of chunk to block?
This is explained in the "key chunks" section in the blog post above. Essentially we only store a tiny subset of chunk -> block by leveraging spatial locality in the file. Trying to store every mapping of chunk -> block can get impractical very quickly.
what does the shards store? Is it "file_name, shard_id, chunk_hash, block_hash"
You can think of the shards as storing mappings between file (identified via file hash) to list of chunks that make up the file.
I hope this help explains our underlying tech better!
data:image/s3,"s3://crabby-images/42ca7/42ca742da9162209d1506d2d83c78c3343baece3" alt=""