DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails Paper • 2502.05163 • Published 16 days ago • 21
PILAF: Optimal Human Preference Sampling for Reward Modeling Paper • 2502.04270 • Published 17 days ago • 11
PILAF: Optimal Human Preference Sampling for Reward Modeling Paper • 2502.04270 • Published 17 days ago • 11 • 2
Teaching Large Language Models to Reason with Reinforcement Learning Paper • 2403.04642 • Published Mar 7, 2024 • 46
A Tale of Tails: Model Collapse as a Change of Scaling Laws Paper • 2402.07043 • Published Feb 10, 2024 • 15