README.md · ibm-granite/GneissWeb.Quality

metadata

license: apache-2.0

Model Summary

This fastText model is used as part of the ensemble filter in GneissWeb to detect and remove low-quality documents.

Please refer to the GneissWeb dataset page for more details.

Developers: IBM Research
Release Date: Feb 21st, 2025
License: Apache 2.0

Training Data

The model is trained on 400k documents, equality split between positive (i.e., high-quality) and negative (i.e., low-quality) classes. Please refer to fasttext text classification tutorial for details. Training data is selected as follows.

Positive documents: 190k synthetic documents randomly sampled from the Cosmopedia dataset, and 10k documents with high educational value selected as follows: first, 600k random documents from FineWeb-V1.1.0 are annotated asking Mixtral-8x22B-Instruct to score each document between 1 to 5 for its educational quality (with 5 being the highest quality), using a prompt similar to the one used by FineWeb-Edu. Then, 10k random documents are selected from documents with scores greater than or equal to 4.
Negative documents: 200k random documents out of the 600k Mixtral-annotated documents with scores less than or equal to 2.