bhatta1's picture
Update README.md
1817e4c verified
metadata
license: apache-2.0

Model Summary

This fastText model is used as part of the ensemble filter in GneissWeb to detect and remove low-quality documents.

Please refer to the GneissWeb dataset page for more details.

  • Developers: IBM Research
  • Release Date: Feb 21st, 2025
  • License: Apache 2.0

Training Data

The model is trained on 400k documents, equality split between positive (i.e., high-quality) and negative (i.e., low-quality) classes. Please refer to fasttext text classification tutorial for details. Training data is selected as follows.

  • Positive documents: 190k synthetic documents randomly sampled from the Cosmopedia dataset, and 10k documents with high educational value selected as follows: first, 600k random documents from FineWeb-V1.1.0 are annotated asking Mixtral-8x22B-Instruct to score each document between 1 to 5 for its educational quality (with 5 being the highest quality), using a prompt similar to the one used by FineWeb-Edu. Then, 10k random documents are selected from documents with scores greater than or equal to 4.
  • Negative documents: 200k random documents out of the 600k Mixtral-annotated documents with scores less than or equal to 2.