Guilherme Penedo
guipenedo
AI & ML interests
None yet
Recent Activity
liked
a Space
4 days ago
nanotron/ultrascale-playbook
liked
a model
7 days ago
deepseek-ai/DeepSeek-R1
new activity
12 days ago
HuggingFaceFW/fineweb:Downloading the 350BT sample uses 990GB of disk space
Organizations
guipenedo's activity
Downloading the 350BT sample uses 990GB of disk space
4
#57 opened about 1 month ago
by
ddh0

Create Ffcc
1
#58 opened 13 days ago
by
Ricky23184
Update 2025/2025-01-22-Torstar.md
#4 opened 23 days ago
by
guipenedo

New update returns a 500 server error using the datasets-server API
6
#18 opened about 2 months ago
by
jonna32
Synthetic Data Generator
1
#5 opened about 1 month ago
by
kishorekashyap
Cannot load with datasets
3
#4 opened about 2 months ago
by
mbanon

A lot of load errors after new update
14
#19 opened about 2 months ago
by
yzhangcs

Add "date" column to "default" subset
#20 opened about 2 months ago
by
lhoestq

Simple exact deduplication removes 2/3 of data.
4
#49 opened 7 months ago
by
egor-pakhomov
Torrent?
3
#4 opened 10 months ago
by
emilss
Any plan to train models on larger subset of dataset?
1
#8 opened 10 months ago
by
mrfakename

Are copyrighted works included in this dataset?
4
#9 opened 10 months ago
by
umm-maybe

Reprocessing for a new language
14
#12 opened 10 months ago
by
pere

Training configs for data ablation study
2
#14 opened 10 months ago
by
jimmyhbx
tiny-fineweb
3
#19 opened 10 months ago
by
3thn

Unsafe files
1
#25 opened 10 months ago
by
alielfilali01

"Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20" using fineweb by Karpathy
#28 opened 9 months ago
by
clem

Regarding to the newly updated indexes(writen as deduplication issues)
5
#29 opened 9 months ago
by
kimcando

Language subset
3
#33 opened 9 months ago
by
talmor