Model Summary

In order to be able to reproduce GneissWeb, we provide here a Bloom filter representing all the document ids of FineWeb 1.1.0 whose documents are part of GneissWeb. it is of size 28GB and is of the rbloom family of Bloom filters. It is to be probed with the id column of FineWeb 1.1.0 or of Common Crawl.

Please refer to the GneissWeb page for more details.

     Developers: IBM Research

     Release Date: Feb 21st, 2025

     License: Apache 2.0.

Testing

The Bloom Filter was tested with

   Positive Examples : ~10M uuids from 192 parquet files in GneissWeb. These span all 96 snapshots

   Negative Examples : 10,000 uuids in CC-MAIN-2024-51 (not present in FineWeb 1.1.0 and also not in GneissWeb)

The Bloom Filter was able to return correct answers for all of them

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Collection including ibm-granite/GneissWeb.bloom