--- viewer: false license: - apache-2.0 language: - en --- **Model Summary** In order to be able to reproduce GneissWeb, we provide here a [Bloom filter](https://dl.acm.org/doi/10.1145/362686.362692) representing all the document ids of FineWeb 1.1.0 whose documents are part of GneissWeb. it is of size 28GB and is of the [rbloom](https://github.com/KenanHanke/rbloom) family of Bloom filters. It is to be probed with the id column of FineWeb 1.1.0 or of Common Crawl. Please refer to the [GneissWeb](https://huggingface.co/datasets/ibm-granite/GneissWeb) page for more details.      **Developers**: IBM Research      **Release Date**: Feb 21st, 2025      **License**: Apache 2.0. **Testing** The Bloom Filter was tested with    Positive Examples : ~10M uuids from 192 parquet files in GneissWeb. These span all 96 snapshots    Negative Examples : 10,000 uuids in CC-MAIN-2024-51 (not present in FineWeb 1.1.0 and also not in GneissWeb) The Bloom Filter was able to return correct answers for all of them