ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks Paper • 2508.15804 • Published 25 days ago • 14
PaperBench: Evaluating AI's Ability to Replicate AI Research Paper • 2504.01848 • Published Apr 2 • 37
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback Paper • 2503.22230 • Published Mar 28 • 46