FineData
community
AI & ML interests
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
Recent Activity
View all activity
Papers
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Organization Card
🍷 FineData
This is the home of the 🍷 FineData team, a branch of the 🤗 Hugging Face Science Team releasing large scale pre-training datasets to accelerate open LLM development.
- 🍷 FineWeb: A 15T tokens English dataset for LLM pre-training. See the blogpost and paper.
- 📚 FineWeb-Edu: a filtered subset of the most educational content from FineWeb.
- 🥂 FineWeb2: an extension of FineWeb to over 1000 languages. See the paper.
- 📄 FinePDFs: 3T tokens of text data extracted from PDFs sourced from the Web.
- 🌐 FineWiki: an updated, better extracted version of Wikipedia in 300+ languages.
- 📄 FinePDFs-Edu: 350B+ highly educational tokens filtered from 📄 FinePDFs
spaces
7
Running
28
FinePDFs: Liberating 3T of the finest tokens from PDFs
📄
Running
11
FineWiki Viewer
🌐
Viewer to explore the finewiki dataset
Running
Featured
1.26k
FineWeb: decanting the web for the finest text data at scale
🍷
Generate high-quality text data for LLMs using FineWeb
Running
86
Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks
📝
Evaluate multilingual models using FineTasks
Build error
Tasks Explorer
🏢
Explore and analyze experiment results
models
105
HuggingFaceFW/finepdfs_edu_classifier_eng_Latn
0.4B
•
Updated
•
7
•
2
HuggingFaceFW/finepdfs_dclm_classifier_eng_Latn
0.4B
•
Updated
•
5
HuggingFaceFW/finepdfs_edu_classifier_v2_eng_Latn
0.4B
•
Updated
•
3
HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn
0.4B
•
Updated
•
7
HuggingFaceFW/finepdfs_edu_classifier_guj_Gujr
0.3B
•
Updated
•
2
HuggingFaceFW/finepdfs_edu_classifier_nno_Latn
0.3B
•
Updated
•
1
HuggingFaceFW/finepdfs_edu_classifier_kaz_Cyrl
0.3B
•
Updated
•
2
HuggingFaceFW/finepdfs_edu_classifier_tam_Taml
0.3B
•
Updated
•
2
HuggingFaceFW/finepdfs_edu_classifier_azj_Latn
0.3B
•
Updated
•
2
HuggingFaceFW/finepdfs_edu_classifier_afr_Latn
0.3B
•
Updated
•
5
datasets
19
HuggingFaceFW/finetranslations
Updated
HuggingFaceFW/CommonsenseQA
Viewer
•
Updated
•
1k
•
43
•
1
HuggingFaceFW/MMLU-Redux-2.0-Generative
Viewer
•
Updated
•
5.43k
•
236
HuggingFaceFW/ARC-Generative
Viewer
•
Updated
•
7.79k
•
60
HuggingFaceFW/finepdfs
Viewer
•
Updated
•
476M
•
23.7k
•
695
HuggingFaceFW/finepdfs-edu
Viewer
•
Updated
•
49.5M
•
4.5k
•
62
HuggingFaceFW/fineweb-2
Viewer
•
Updated
•
4.48B
•
63.2k
•
710
HuggingFaceFW/finewiki
Viewer
•
Updated
•
61.6M
•
9.12k
•
271
HuggingFaceFW/clean-wikipedia
Viewer
•
Updated
•
61.2M
•
1.12k
•
23
HuggingFaceFW/finepdfs_lang_classification_tmp
Updated
•
6