The model could be super depressed and stressed out!
Daniel van Strien PRO
davanstrien
AI & ML interests
Machine Learning Librarian
Recent Activity
updated
a dataset
about 1 hour ago
librarian-bots/dataset_cards_with_metadata
updated
a dataset
about 4 hours ago
librarian-bots/dataset-columns
updated
a dataset
about 4 hours ago
data-is-better-together/fineweb-c-progress
Organizations
Hope so!
Yeah, quite bold that they put health + legal use cases so prominently

reacted to
clem's
post with 🔥
14 days ago
Post
3290
Thread to gossip during the
openai
GPT-5 livestream: https://www.youtube.com/watch?v=0Uu_VJeVVfo. Feel free to post your impressions below!

Very off topic, but on the theme of music to welcome aliens, this short film is lovely: https://www.youtube.com/watch?v=Jr83bJsT6OA!

reacted to
dmoxy's
post with 👍
about 1 month ago
Post
429
🥐 Got a Croissant URL?
Here is the fastest way to ingest it into a high-performance multimodal database.
Workflow #1: Croissant Ingestion is live on ApertureDB Cloud — the first release in our Summer Workflows series.
Plug in any MLCommons Croissant-formatted dataset and this ready-to-run workflow will:
✅Parse Croissant metadata
📥Download all linked assets (images, text, video, etc.)
📦Ingest them into ApertureDB, preserving structure and relationships
All with just a few lines of Python.
Whether you are working with public datasets from Hugging Face or prepping production-ready data pipelines — this is the ingestion flow you’ve been waiting for.
🎥 Watch a quick demo→ https://www.youtube.com/watch?v=6cWcZ2G53gE
🔗 Try it → https://cloud.aperturedata.io/signup?campaign=WF1Croissant
📚 Docs → https://docs.aperturedata.io/workflows/ingest_from_croissant
💬 Tell us what you think about this workflow and what you'd like to see us build next. Hit the comments with your ideas— We're listening!
We are launching one new workflow every Wednesday — 12 in total.
Follow along all summer☀️.
Here is the fastest way to ingest it into a high-performance multimodal database.
Workflow #1: Croissant Ingestion is live on ApertureDB Cloud — the first release in our Summer Workflows series.
Plug in any MLCommons Croissant-formatted dataset and this ready-to-run workflow will:
✅Parse Croissant metadata
📥Download all linked assets (images, text, video, etc.)
📦Ingest them into ApertureDB, preserving structure and relationships
All with just a few lines of Python.
Whether you are working with public datasets from Hugging Face or prepping production-ready data pipelines — this is the ingestion flow you’ve been waiting for.
🎥 Watch a quick demo→ https://www.youtube.com/watch?v=6cWcZ2G53gE
🔗 Try it → https://cloud.aperturedata.io/signup?campaign=WF1Croissant
📚 Docs → https://docs.aperturedata.io/workflows/ingest_from_croissant
💬 Tell us what you think about this workflow and what you'd like to see us build next. Hit the comments with your ideas— We're listening!
We are launching one new workflow every Wednesday — 12 in total.
Follow along all summer☀️.

posted
an
update
2 months ago
Post
3429
Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.
Key capabilities:
- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation
The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.
Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)
https://github.com/davanstrien/hub-semantic-search-mcp
Key capabilities:
- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation
The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.
Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)
https://github.com/davanstrien/hub-semantic-search-mcp

reacted to
cbensimon's
post with 🔥
3 months ago
Post
5943
🚀 ZeroGPU
Nothing too fancy for now—ZeroGPU Spaces still default to
- 💰 size-based quotas / pricing (
- 🦣 the upcoming
You can as of now control GPU size via a Space variable. Accepted values:
-
-
-
The auto mode checks total CUDA tensor size during startup:
- More than 30GB →
- Otherwise →
medium
size is now available as a power-user featureNothing too fancy for now—ZeroGPU Spaces still default to
large
(70GB VRAM)—but this paves the way for:- 💰 size-based quotas / pricing (
medium
will offer significantly more usage than large
)- 🦣 the upcoming
xlarge
size (141GB VRAM)You can as of now control GPU size via a Space variable. Accepted values:
-
auto
(future default)-
medium
-
large
(current default)The auto mode checks total CUDA tensor size during startup:
- More than 30GB →
large
- Otherwise →
medium
Post
2308
Came across a very nice submission from
@marcodsn
for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).
The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:
- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model
It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.
I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.
Dataset can be found here: marcodsn/academic-chains (give it a like!)
The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:
- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model
It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.
I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.
Dataset can be found here: marcodsn/academic-chains (give it a like!)

posted
an
update
4 months ago
Post
2308
Came across a very nice submission from
@marcodsn
for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).
The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:
- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model
It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.
I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.
Dataset can be found here: marcodsn/academic-chains (give it a like!)
The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:
- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model
It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.
I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.
Dataset can be found here: marcodsn/academic-chains (give it a like!)

reacted to
jasoncorkill's
post with 🔥
4 months ago
Post
3084
🔥 Yesterday was a fire day!
We dropped two brand-new datasets capturing Human Preferences for text-to-video and text-to-image generations powered by our own crowdsourcing tool!
Whether you're working on model evaluation, alignment, or fine-tuning, this is for you.
1. Text-to-Video Dataset (Pika 2.2 model):
Rapidata/text-2-video-human-preferences-pika2.2
2. Text-to-Image Dataset (Reve-AI Halfmoon):
Rapidata/Reve-AI-Halfmoon_t2i_human_preference
Let’s train AI on AI-generated content with humans in the loop.
Let’s make generative models that actually get us.
We dropped two brand-new datasets capturing Human Preferences for text-to-video and text-to-image generations powered by our own crowdsourcing tool!
Whether you're working on model evaluation, alignment, or fine-tuning, this is for you.
1. Text-to-Video Dataset (Pika 2.2 model):
Rapidata/text-2-video-human-preferences-pika2.2
2. Text-to-Image Dataset (Reve-AI Halfmoon):
Rapidata/Reve-AI-Halfmoon_t2i_human_preference
Let’s train AI on AI-generated content with humans in the loop.
Let’s make generative models that actually get us.

reacted to
ajibawa-2023's
post with 🔥
4 months ago
Post
4345
Hi All, I recently released two Audio datasets which are generated using my earlier released dataset:
ajibawa-2023/Children-Stories-Collection
First Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection-Large has 5600++ stories in .mp3 format.
Second Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection has 600 stories in .mp3 format.
First Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection-Large has 5600++ stories in .mp3 format.
Second Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection has 600 stories in .mp3 format.

reacted to
jasoncorkill's
post with 🚀🔥
4 months ago
Post
3300
🚀 We tried something new!
We just published a dataset using a new (for us) preference modality: direct ranking based on aesthetic preference. We ranked a couple of thousand images from most to least preferred, all sampled from the Open Image Preferences v1 dataset by the amazing @data-is-better-together team.
📊 Check it out here:
Rapidata/2k-ranked-images-open-image-preferences-v1
We're really curious to hear your thoughts!
Is this kind of ranking interesting or useful to you? Let us know! 💬
If it is, please consider leaving a ❤️ and if we hit 30 ❤️s, we’ll go ahead and rank the full 17k image dataset!
We just published a dataset using a new (for us) preference modality: direct ranking based on aesthetic preference. We ranked a couple of thousand images from most to least preferred, all sampled from the Open Image Preferences v1 dataset by the amazing @data-is-better-together team.
📊 Check it out here:
Rapidata/2k-ranked-images-open-image-preferences-v1
We're really curious to hear your thoughts!
Is this kind of ranking interesting or useful to you? Let us know! 💬
If it is, please consider leaving a ❤️ and if we hit 30 ❤️s, we’ll go ahead and rank the full 17k image dataset!

replied to
jasoncorkill's
post
4 months ago
This is very cool! I was always curious about doing something like this! Could be quite cool to train a "aesthic preference model" on this kind of dataset. Could be quite cool to try and use as a reward model for image gen training...
cc @sayakpaul @multimodalart @linoyts @davidberenstein1957 who might also find this data interesting :)

reacted to
jasoncorkill's
post with ❤️
4 months ago
Post
3300
🚀 We tried something new!
We just published a dataset using a new (for us) preference modality: direct ranking based on aesthetic preference. We ranked a couple of thousand images from most to least preferred, all sampled from the Open Image Preferences v1 dataset by the amazing @data-is-better-together team.
📊 Check it out here:
Rapidata/2k-ranked-images-open-image-preferences-v1
We're really curious to hear your thoughts!
Is this kind of ranking interesting or useful to you? Let us know! 💬
If it is, please consider leaving a ❤️ and if we hit 30 ❤️s, we’ll go ahead and rank the full 17k image dataset!
We just published a dataset using a new (for us) preference modality: direct ranking based on aesthetic preference. We ranked a couple of thousand images from most to least preferred, all sampled from the Open Image Preferences v1 dataset by the amazing @data-is-better-together team.
📊 Check it out here:
Rapidata/2k-ranked-images-open-image-preferences-v1
We're really curious to hear your thoughts!
Is this kind of ranking interesting or useful to you? Let us know! 💬
If it is, please consider leaving a ❤️ and if we hit 30 ❤️s, we’ll go ahead and rank the full 17k image dataset!

posted
an
update
4 months ago
Post
1728
I've created a v1 dataset (
davanstrien/reasoning-required) and model (
davanstrien/ModernBERT-based-Reasoning-Required) to help curate "wild text" data for generating reasoning examples beyond the usual code/math/science domains.
- I developed a "Reasoning Required" dataset with a 0-4 scoring system for reasoning complexity
- I used educational content from HuggingFaceFW/fineweb-edu, adding annotations for domains, reasoning types, and example questions
My approach enables a more efficient workflow: filter text with small models first, then use LLMs only on high-value content.
This significantly reduces computation costs while expanding reasoning dataset domain coverage.
- I developed a "Reasoning Required" dataset with a 0-4 scoring system for reasoning complexity
- I used educational content from HuggingFaceFW/fineweb-edu, adding annotations for domains, reasoning types, and example questions
My approach enables a more efficient workflow: filter text with small models first, then use LLMs only on high-value content.
This significantly reduces computation costs while expanding reasoning dataset domain coverage.

posted
an
update
6 months ago
Post
2972
📊 Introducing "Hugging Face Dataset Spotlight" 📊
I'm excited to share the first episode of our AI-generated podcast series focusing on nice datasets from the Hugging Face Hub!
This first episode explores mathematical reasoning datasets:
- SynthLabsAI/Big-Math-RL-Verified: Over 250,000 rigorously verified problems spanning multiple difficulty levels and mathematical domains
- open-r1/OpenR1-Math-220k: 220,000 math problems with multiple reasoning traces, verified for accuracy using Math Verify and Llama-3.3-70B models.
- facebook/natural_reasoning: 1.1 million general reasoning questions carefully deduplicated and decontaminated from existing benchmarks, showing superior scaling effects when training models like Llama3.1-8B-Instruct.
Plus a bonus segment on bespokelabs/bespoke-manim!
https://www.youtube.com/watch?v=-TgmRq45tW4
I'm excited to share the first episode of our AI-generated podcast series focusing on nice datasets from the Hugging Face Hub!
This first episode explores mathematical reasoning datasets:
- SynthLabsAI/Big-Math-RL-Verified: Over 250,000 rigorously verified problems spanning multiple difficulty levels and mathematical domains
- open-r1/OpenR1-Math-220k: 220,000 math problems with multiple reasoning traces, verified for accuracy using Math Verify and Llama-3.3-70B models.
- facebook/natural_reasoning: 1.1 million general reasoning questions carefully deduplicated and decontaminated from existing benchmarks, showing superior scaling effects when training models like Llama3.1-8B-Instruct.
Plus a bonus segment on bespokelabs/bespoke-manim!
https://www.youtube.com/watch?v=-TgmRq45tW4

reacted to
stefan-it's
post with 🔥
6 months ago
Post
3266
After running some 3DMark and FurMark benchmarks on Windows to make sure that my new 5090 is not causing melting cables [1] and some nice shots with a thermal camera (I don't think that's too much), running some fine-tuning experiments with my favorite Flair & Transformers libraries are very easy to perform.
Important steps:
Good idea is to start with a fresh Ubuntu 24.04 installation with latest CUDA 12.8 and the open NVIDIA driver - follow more advices from [2]:
I tried update from an existing Ubuntu installation with an older CUDA and driver version and it resulted in a non-startable system.
If you are using PyTorch 2.6 with built CUDA 12.6 it will result in:
But no worries! For PyTorch you need just to use a nightly 2.7 version that was built with CUDA 12.8. This can easily done via:
After that the latest Flair version can be installed and fine-tuning will work!
References:
[1]: https://www.reddit.com/r/nvidia/comments/1inpox7/rtx_50_series_12vhpwr_megathread/
[2]: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=24.04&target_type=deb_network
Important steps:
Good idea is to start with a fresh Ubuntu 24.04 installation with latest CUDA 12.8 and the open NVIDIA driver - follow more advices from [2]:
sudo apt -y install cuda-toolkit-12-8 nvidia-open
I tried update from an existing Ubuntu installation with an older CUDA and driver version and it resulted in a non-startable system.
If you are using PyTorch 2.6 with built CUDA 12.6 it will result in:
NVIDIA Graphics Device with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.
But no worries! For PyTorch you need just to use a nightly 2.7 version that was built with CUDA 12.8. This can easily done via:
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
After that the latest Flair version can be installed and fine-tuning will work!
References:
[1]: https://www.reddit.com/r/nvidia/comments/1inpox7/rtx_50_series_12vhpwr_megathread/
[2]: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=24.04&target_type=deb_network

posted
an
update
6 months ago
Post
3717
Quick POC: Turn a Hugging Face dataset card into a short podcast introducing the dataset using all open models.
I think I'm the only weirdo who would enjoy listening to something like this though 😅
Here is an example for eth-nlped/stepverify
I think I'm the only weirdo who would enjoy listening to something like this though 😅
Here is an example for eth-nlped/stepverify