--- license: apache-2.0 --- **Model Summary** This fastText model is used as part of the ensemble filter in GneissWeb to detect and remove low-quality documents. Please refer to the [GneissWeb](https://huggingface.co/datasets/ibm-granite/GneissWeb) dataset page for more details. - **Developers**: IBM Research - **Release Date**: Feb 21st, 2025 - **License**: Apache 2.0 **Training Data** The model is trained on 400k documents, equality split between positive (i.e., high-quality) and negative (i.e., low-quality) classes. Please refer to [fasttext text classification tutorial](https://fasttext.cc/docs/en/python-module.html) for details. Training data is selected as follows. - *Positive documents*: 190k synthetic documents randomly sampled from the [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) dataset, and 10k documents with high educational value selected as follows: first, 600k random documents from [FineWeb-V1.1.0](https://huggingface.co/datasets/HuggingFaceFW/fineweb) are annotated asking Mixtral-8x22B-Instruct to score each document between 1 to 5 for its educational quality (with 5 being the highest quality), using a prompt similar to the one used by [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu). Then, 10k random documents are selected from documents with scores greater than or equal to 4. - *Negative documents*: 200k random documents out of the 600k Mixtral-annotated documents with scores less than or equal to 2.