Abstract
A unified policy gradient estimator and Hybrid Post-Training algorithm effectively combine online and offline data for post-training language models, improving performance across various benchmarks.
Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.
Community
Do SFT and HPT use the same computation and data amount in the benchmark results? Or is your finding that there shouldn't be significant differences with the same computation and data, because it is the same optimization process?
Thanks for your question and your interest in our work. HPT integrates SFT and RL, and we show that SFT and RL objectives can be optimized jointly within a single loss.
As discussed in Section 3.3, while all algorithms share the same Common Objective, bias-variance trade-offs still exist across current instances for different components of the unified gradient estimator. Accordingly, we do not claim that there will be no significant differences across algorithms given the same compute and data, and meaningful differences can certainly arise.
We follow the setup of our main baseline, LUFFY (arXiv:2504.14945): for SFT we train for 3 epochs on ~46k examples (≈138k example passes). For HPT, we run 500 optimization steps with a batch size of 128, totaling ~64k examples. Because HPT dynamically switches between SFT and RL, its training budget does not exceed that of the RL configuration (which also uses 500 steps). The apparent gap relative to the SFT setting is essentially the inherent RL-vs-SFT compute difference; given their distinct learning dynamics, it is not customary to compare their compute budgets head-to-head.
We hope this clarifies our intent and settings, and we’re happy to share more details or additional ablations if helpful.
Nice paper! @XingtaiHF Feel free to claim it with your HF account by clicking your name on the paper page!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper