sirahd (Sam Horradarn)

updated a model 4 days ago

sirahd/test-xet-migration-2

Updated 4 days ago

published a model 4 days ago

sirahd/test-xet-migration-2

Updated 4 days ago

commented on From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub 5 days ago

How can we find the chunk content using chunk hash?

Chunk hash is calculated via content-defined chunking (CDC), which means that if two chunks have the same content they will share the same hash. CDC removes the need to store the mapping between chunk hash -> chunk content because we know if two chunks share the same hash, they will have identical content.

The CAS system only stores "block_hash -> block_content", Where does the map of chunk to block?

This is explained in the "key chunks" section in the blog post above. Essentially we only store a tiny subset of chunk -> block by leveraging spatial locality in the file. Trying to store every mapping of chunk -> block can get impractical very quickly.

what does the shards store? Is it "file_name, shard_id, chunk_hash, block_hash"

You can think of the shards as storing mappings between file (identified via file hash) to list of chunks that make up the file.

I hope this help explains our underlying tech better!

upvoted an article 5 days ago

Article

From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub

12 days ago

• 49

updated a dataset 4 months ago

sirahd/test

Viewer • Updated Oct 17, 2024 • 14 • 9

Sam Horradarn

AI & ML interests

Recent Activity

Organizations

sirahd's activity

sirahd/test-xet-migration-2

sirahd/test-xet-migration-2

From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub

sirahd/test