Granite Data
Collection
This collection has a set of artifacts which are related to curating and evaluating datasets used for Granite models
•
9 items
•
Updated
•
3
Model Summary
This fastText model is used as part of the ensemble filter in GneissWeb to detect and remove low-quality documents.
Please refer to the GneissWeb dataset page for more details.
Training Data
The model is trained on 400k documents, equality split between positive (i.e., high-quality) and negative (i.e., low-quality) classes. Please refer to fasttext text classification tutorial for details. Training data is selected as follows.