metadata
license: apache-2.0
Model Summary
This fastText model is used as part of the ensemble filter in GneissWeb to detect and remove low-quality documents.
Please refer to the GneissWeb dataset page for more details.
- Developers: IBM Research
- Release Date: Feb 21st, 2025
- License: Apache 2.0
Training Data
The model is trained on 400k documents, equality split between positive (i.e., high-quality) and negative (i.e., low-quality) classes. Please refer to fasttext text classification tutorial for details. Training data is selected as follows.
- Positive documents: 190k synthetic documents randomly sampled from the Cosmopedia dataset, and 10k documents with high educational value selected as follows: first, 600k random documents from FineWeb-V1.1.0 are annotated asking Mixtral-8x22B-Instruct to score each document between 1 to 5 for its educational quality (with 5 being the highest quality), using a prompt similar to the one used by FineWeb-Edu. Then, 10k random documents are selected from documents with scores greater than or equal to 4.
- Negative documents: 200k random documents out of the 600k Mixtral-annotated documents with scores less than or equal to 2.