Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up

datablations

https://github.com/huggingface/datablations
Activity Feed Request to join this org

AI & ML interests

Scaling Data-Constrained Language Models

Recent Activity

thomwolf  authored a paper about 2 months ago
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
craffel  authored a paper about 2 months ago
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
craffel  authored a paper 3 months ago
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
View all activity

Thomas Wolf's profile picture Teven Le Scao's profile picture Sasha Rush's profile picture Niklas Muennighoff's profile picture Aleksandra Piktus's profile picture Nouamane Tazi's profile picture Sampo Pyysalo's profile picture Colin Raffel's profile picture Risto Luukkonen's profile picture

datablations 's datasets 13

datablations/scripts

Viewer • Updated Jun 15, 2023 • 3.48M • 2.39k

datablations/oscar-subsets

Viewer • Updated Jun 14, 2023 • 365k • 343

datablations/c4-subsets

Viewer • Updated Jun 14, 2023 • 729k • 1.06k • 5

datablations/c4-filter-megatron

Updated May 28, 2023 • 313

datablations/oscar-filter-megatron

Updated May 27, 2023 • 187

datablations/python-megatron

Updated May 22, 2023 • 1.99k • 1

datablations/subsets

Viewer • Updated May 10, 2023 • 365k • 32

datablations/oscar-filter

Viewer • Updated May 10, 2023 • 432M • 795

datablations/oscar-dedup-expanded

Viewer • Updated May 10, 2023 • 432M • 91

datablations/mup

Updated Apr 24, 2023 • 567

datablations/c4-filter

Viewer • Updated Feb 1, 2023 • 365M • 2.54k

datablations/c4-filter-small

Viewer • Updated Jan 17, 2023 • 100k • 72

datablations/oscar-filter-small

Viewer • Updated Nov 24, 2022 • 100k • 11
Company
TOS Privacy About Jobs
Website
Models Datasets OCR模型免费转Markdown Pricing 模型下载攻略