Model Summary

This fastText model is used as part of the ensemble filter in GneissWeb to detect and remove low-quality documents.

Please refer to the GneissWeb dataset page for more details.

  • Developers: IBM Research
  • Release Date: Feb 21st, 2025
  • License: Apache 2.0

Training Data

The model is trained on 400k documents, equality split between positive (i.e., high-quality) and negative (i.e., low-quality) classes. Please refer to fasttext text classification tutorial for details. Training data is selected as follows.

  • Positive documents: 190k synthetic documents randomly sampled from the Cosmopedia dataset, and 10k documents with high educational value selected as follows: first, 600k random documents from FineWeb-V1.1.0 are annotated asking Mixtral-8x22B-Instruct to score each document between 1 to 5 for its educational quality (with 5 being the highest quality), using a prompt similar to the one used by FineWeb-Edu. Then, 10k random documents are selected from documents with scores greater than or equal to 4.
  • Negative documents: 200k random documents out of the 600k Mixtral-annotated documents with scores less than or equal to 2.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Collection including ibm-granite/GneissWeb.Quality_annotator