GLiNER-PII Collection PII detection models developed in collaboration with Wordcab • 5 items • Updated Sep 24, 2025 • 21
NeMo Curator - Classifier Models Collection Classifier models that can be used in NeMo Curator for labelling/filtering datasets. • 11 items • Updated 8 days ago • 24
Tiny Language Model Datasets Collection Collection of Synthetic Datasets that can be used in pretraining of any the Tiny Language Model • 14 items • Updated Sep 21, 2025 • 29
view article Article Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers +5 Sep 11, 2025 • 176
GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface Paper • 2507.18546 • Published Jul 24, 2025 • 30
SauerkrautLM-Multilingual-(Reason)-ColBERT Collection SauerkrautLM ColBERT is a suite of Late-Interaction retrieval models built with PyLate’s ColBERT architecture and tuned for seven European languages. • 7 items • Updated Aug 3, 2025 • 20
view article Article Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face +3 Jul 29, 2025 • 206
EuroBERT Collection Scaling Multilingual Encoders for European Languages • 4 items • Updated Mar 10, 2025 • 13
OpenReasoning-Nemotron Collection Collection of models for OpenReasoning-Nemotron which are trained on 5M reasoning traces for Math, Code and Science. • 6 items • Updated 8 days ago • 46
view article Article FineWeb-C: A Community-Driven Dataset for Educational Quality Annotations in 122 Languages Jul 8, 2025 • 33
ModernBERT Collection Bringing BERT into modernity via both architecture changes and scaling • 3 items • Updated Dec 19, 2024 • 151
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents Paper • 2501.08828 • Published Jan 15, 2025 • 30
Scaling Synthetic Data Creation with 1,000,000,000 Personas Paper • 2406.20094 • Published Jun 28, 2024 • 104
view article Article How to build a custom text classifier without days of human labeling Oct 17, 2024 • 56
Unifying Multimodal Retrieval via Document Screenshot Embedding Paper • 2406.11251 • Published Jun 17, 2024 • 11
view article Article Docmatix - a huge dataset for Document Visual Question Answering Jul 18, 2024 • 77