Data and other things
updated
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Paper
• 2412.14475
• Published
• 57
How to Synthesize Text Data without Model Collapse?
Paper
• 2412.14689
• Published
• 53
Token-Budget-Aware LLM Reasoning
Paper
• 2412.18547
• Published
• 46
WavePulse: Real-time Content Analytics of Radio Livestreams
Paper
• 2412.17998
• Published
• 11
Bridging the Data Provenance Gap Across Text, Speech and Video
Paper
• 2412.17847
• Published
• 11
No More Adam: Learning Rate Scaling at Initialization is All You Need
Paper
• 2412.11768
• Published
• 43
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining
Paper
• 2501.00958
• Published
• 109
URSA: Understanding and Verifying Chain-of-thought Reasoning in
Multimodal Mathematics
Paper
• 2501.04686
• Published
• 53
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper
• 2501.00192
• Published
• 31
OmniThink: Expanding Knowledge Boundaries in Machine Writing through
Thinking
Paper
• 2501.09751
• Published
• 46
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in
Post-Training
Paper
• 2501.18511
• Published
• 20
LIMO: Less is More for Reasoning
Paper
• 2502.03387
• Published
• 62
Scaling Pre-training to One Hundred Billion Data for Vision Language
Models
Paper
• 2502.07617
• Published
• 29
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
Paper
• 2502.05003
• Published
• 44
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
Paper
• 2502.07870
• Published
• 45
Jailbreaking to Jailbreak
Paper
• 2502.09638
• Published
• 6
Scaling Text-Rich Image Understanding via Code-Guided Synthetic
Multimodal Data Generation
Paper
• 2502.14846
• Published
• 14
Paper
• 2503.08507
• Published
• 7
"Principal Components" Enable A New Language of Images
Paper
• 2503.08685
• Published
• 12
YuE: Scaling Open Foundation Models for Long-Form Music Generation
Paper
• 2503.08638
• Published
• 72
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural
Vision-Language Dataset for Southeast Asia
Paper
• 2503.07920
• Published
• 101
Any2Caption:Interpreting Any Condition to Caption for Controllable Video
Generation
Paper
• 2503.24379
• Published
• 76
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Paper
• 2504.00072
• Published
• 6
Advances and Challenges in Foundation Agents: From Brain-Inspired
Intelligence to Evolutionary, Collaborative, and Safe Systems
Paper
• 2504.01990
• Published
• 303
URECA: Unique Region Caption Anything
Paper
• 2504.05305
• Published
• 35
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction
Fine-Tuning
Paper
• 2504.09081
• Published
• 16
BookWorld: From Novels to Interactive Agent Societies for Creative Story
Generation
Paper
• 2504.14538
• Published
• 30
Towards Understanding Camera Motions in Any Video
Paper
• 2504.15376
• Published
• 155
Alchemist: Turning Public Text-to-Image Data into Generative Gold
Paper
• 2505.19297
• Published
• 84
PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image
Generative Models
Paper
• 2505.22523
• Published
• 7
Large Language Models for Data Synthesis
Paper
• 2505.14752
• Published
• 49
HardTests: Synthesizing High-Quality Test Cases for LLM Coding
Paper
• 2505.24098
• Published
• 43
OpenThoughts: Data Recipes for Reasoning Models
Paper
• 2506.04178
• Published
• 53
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
Paper
• 2506.02096
• Published
• 52
One Missing Piece for Open-Source Reasoning Models: A Dataset to
Mitigate Cold-Starting Short CoT LLMs in RL
Paper
• 2506.02338
• Published
• 5
Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning
Vision Models from DataSeeds' Annotated Imagery
Paper
• 2506.05673
• Published
• 10
Institutional Books 1.0: A 242B token dataset from Harvard Library's
collections, refined for accuracy and usability
Paper
• 2506.08300
• Published
• 9
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Paper
• 2506.10857
• Published
• 30
Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture
without Training
Paper
• 2506.10952
• Published
• 22
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image
Generation
Paper
• 2506.18095
• Published
• 66
Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in
LLMs
Paper
• 2506.19290
• Published
• 53
NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining
Paper
• 2507.14119
• Published
• 60
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science
Reasoning
Paper
• 2507.16812
• Published
• 63
PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized
Timestep Adaptation
Paper
• 2507.16116
• Published
• 13
GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset
Paper
• 2507.21033
• Published
• 23
HPSv3: Towards Wide-Spectrum Human Preference Score
Paper
• 2508.03789
• Published
• 20
Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved
Image Generation
Paper
• 2508.09987
• Published
• 25
Open Data Synthesis For Deep Research
Paper
• 2509.00375
• Published
• 72
IntrEx: A Dataset for Modeling Engagement in Educational Conversations
Paper
• 2509.06652
• Published
• 26
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Paper
• 2509.09676
• Published
• 35
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform
Data
Paper
• 2509.15221
• Published
• 111
AutoIntent: AutoML for Text Classification
Paper
• 2509.21138
• Published
• 36
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation
and Editing
Paper
• 2509.24900
• Published
• 53
Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining
Levels
Paper
• 2510.06499
• Published
• 33
Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully
Open MLLMs
Paper
• 2510.13795
• Published
• 59
Scaling Instruction-Based Video Editing with a High-Quality Synthetic
Dataset
Paper
• 2510.15742
• Published
• 51
FineVision: Open Data Is All You Need
Paper
• 2510.17269
• Published
• 75
UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale
High-Quality Dataset
Paper
• 2510.20661
• Published
• 15
DRIVE: Data Curation Best Practices for Reinforcement Learning with
Verifiable Reward in Competitive Code Generation
Paper
• 2511.06307
• Published
• 53
CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis
Paper
• 2511.07790
• Published
• 3
EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation
Paper
• 2511.11002
• Published
• 4
CATS-V2V: A Real-World Vehicle-to-Vehicle Cooperative Perception Dataset with Complex Adverse Traffic Scenarios
Paper
• 2511.11168
• Published
• 2
UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits
Paper
• 2512.02790
• Published
• 7
EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing
Paper
• 2512.06065
• Published
• 29
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
Paper
• 2512.14051
• Published
• 46
VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs
Paper
• 2512.12072
• Published
• 17
Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection
Paper
• 2512.16905
• Published
• 32
Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset
Paper
• 2512.24160
• Published
• 3
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs
Paper
• 2601.17058
• Published
• 189
OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration
Paper
• 2602.05400
• Published
• 343
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Paper
• 2602.08234
• Published
• 68
Code2World: A GUI World Model via Renderable Code Generation
Paper
• 2602.09856
• Published
• 198
Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning
Paper
• 2602.09439
• Published
• 13
Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs
Paper
• 2602.10388
• Published
• 238
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
Paper
• 2602.12705
• Published
• 64
On Data Engineering for Scaling LLM Terminal Capabilities
Paper
• 2602.21193
• Published
• 90
The Trinity of Consistency as a Defining Principle for General World Models
Paper
• 2602.23152
• Published
• 194