Model Summary

This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: Predictive Data Selection: The Data That Predicts Is the Data That Teaches . And this is also the classifier we used to build PreSelect-100B dataset with a selection threshold of 10%. The positive label name and negative label name are "__label__1" and "__label__0" respectively.

How to use

You can refer to the code repo of the paper to directly run the filtering with any fastText model or simply:

import os
import argparse
from pathlib import Path

parser = argparse.ArgumentParser("Filter")
parser.add_argument("--input_path",type=str, help="input path name")
parser.add_argument("--output_path",type=str, help="output name")

args = parser.parse_args()
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.filters import FastTextClassifierFilter
from datatrove.pipeline.readers import ParquetReader,JsonlReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
Path(f"{args.output_path}").mkdir(parents=True,exist_ok=True)

dist_executor = LocalPipelineExecutor(
    skip_completed=False,
    pipeline=[
        JsonlReader(f"{args.input_path}", text_key="text", default_metadata= {}),
        FastTextClassifierFilter(f"PreSelect-classifier.bin", keep_labels=[("1",0.5)]), 
        JsonlWriter(f"{args.output_path}", compression=None)
    ],
    tasks=100,
)
dist_executor.run()

Training

For more training details, you can refer to the paper and the training code is available on GitHub PreSelect.

Citation

If you find this work helpful, please kindly cite as:

@article{shum2025predictivedataselectiondata,
      title={Predictive Data Selection: The Data That Predicts Is the Data That Teaches}, 
      author={Kashun Shum and Yuzhen Huang and Hongjian Zou and Ding Qi and Yixuan Liao and Xiaoxin Chen and Qian Liu and Junxian He},
      journal={arXiv preprint arXiv:2503.00808},
      year={2025},
      eprint={2503.00808},
}

Downloads last month: 13

Collection including hkust-nlp/preselect-fasttext-classifier

PreSelect

Collection

3 items • Updated May 5, 2025 • 1

Paper for hkust-nlp/preselect-fasttext-classifier

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Paper • 2503.00808 • Published Mar 2, 2025 • 57