HuggingFaceFW/fineweb
Viewer
•
Updated
•
52.5B
•
311k
•
2.32k
A collection of datasets for LLM pretraining
Note 🍷 Web datasets
Note 📚 Highly curated web datasets filtered using classifiers
Note 📐 Highly curated math pages from CommonCrawl
Note 💻 Github code dataset
Note Synthetic textbooks
Note Contains Cosmopedia v2 (synthetic textbooks) and Python-Edu (educational Python code)