Model Summary

This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: Predictive Data Selection: The Data That Predicts Is the Data That Teaches . And this is also the classifier we used to build PreSelect-100B dataset with a selection threshold of 10%. The positive label name and negative label name are "__label__1" and "__label__0" respectively.

How to use

You can refer to the code repo of the paper to directly run the filtering with any fastText model or simply:

import os
import argparse
from pathlib import Path

parser = argparse.ArgumentParser("Filter")
parser.add_argument("--input_path",type=str, help="input path name")
parser.add_argument("--output_path",type=str, help="output name")

args = parser.parse_args()
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.filters import FastTextClassifierFilter
from datatrove.pipeline.readers import ParquetReader,JsonlReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
Path(f"{args.output_path}").mkdir(parents=True,exist_ok=True)

dist_executor = LocalPipelineExecutor(
    skip_completed=False,
    pipeline=[
        JsonlReader(f"{args.input_path}", text_key="text", default_metadata= {}),
        FastTextClassifierFilter(f"PreSelect-classifier.bin", keep_labels=[("1",0.5)]), 
        JsonlWriter(f"{args.output_path}", compression=None)
    ],
    tasks=100,
)
dist_executor.run()

Training

For more training details, you can refer to the paper and the training code is available on GitHub PreSelect.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Collection including hkust-nlp/preselect-fasttext-classifier