viewer: false
license:
- apache-2.0
language:
- en
Model Summary
In order to be able to reproduce GneissWeb, we provide here a Bloom filter representing all the document ids of FineWeb 1.1.0 whose documents are part of GneissWeb. it is of size 28GB and is of the rbloom family of Bloom filters. It is to be probed with the id column of FineWeb 1.1.0 or of Common Crawl.
Please refer to the GneissWeb page for more details.
Developers: IBM Research
Release Date: Feb 21st, 2025
License: Apache 2.0.
Testing
The Bloom Filter was tested with
Positive Examples : ~10M uuids from 192 parquet files in GneissWeb. These span all 96 snapshots
Negative Examples : 10,000 uuids in CC-MAIN-2024-51 (not present in FineWeb 1.1.0 and also not in GneissWeb)
The Bloom Filter was able to return correct answers for all of them