Abstract
Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling (2025)
- Reinforcement Learning Enhanced LLMs: A Survey (2024)
- rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking (2025)
- OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning (2024)
- Diving into Self-Evolving Training for Multimodal Reasoning (2024)
- o1-Coder: an o1 Replication for Coding (2024)
- Mars-PO: Multi-Agent Reasoning System Preference Optimization (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
hi
No open weights??
If exists please share link.
Missing Citations: Kimi K1.5 Training Loss Function Mirrors SPPO/GPO Loss
I’ve been reviewing the training loss function presented in the Kimi K1.5 paper and have serious concerns regarding its originality. It appears that the loss function is nearly identical in form to the one introduced in the SPPO/GPO work. Specifically, both formulations involve a squared error between a log-probability ratio and a scaled reward (or preference score) term, along with a normalization factor. Here’s a side-by-side comparison:
GPO/SPPO Loss
(Excerpt from GPO Paper (https://arxiv.org/abs/2410.02197) and SPPO Paper (https://arxiv.org/abs/2405.00675))
Kimi K1.5 Loss
(Excerpt from the Kimi K1.5 Paper)
While the notations differ slightly (e.g., $r(x,y,y^*)$ vs. $\widehat{s}\left(\mathbf{y} \succ \pi_{\theta_t} \mid \mathbf{x}\right)$ and $\tau$ vs. $1/\beta$, the structural similarity is striking. Both losses adjust the policy by matching a log-probability ratio to a reward (or preference score) signal, with a normalization constant to stabilize training.
Questions/Concerns:
Justification for Differences:
If there are intended differences (e.g., in reward computation or sampling strategy), could the authors clearly delineate these differences?Proper Attribution and Citation:
The training loss function in Kimi K1.5 appears to directly mirror that of the SPPO/GPO work. Could the authors update the manuscript to include explicit citations to the original SPPO/GPO papers? Proper attribution is crucial to maintain academic integrity and give due credit for prior work.
Request for Clarification:
This issue is raised to ensure academic integrity and clarity on the contributions of the Kimi K1.5 paper. A prompt explanation or pointer to supplementary material addressing these concerns, including a revision of citations, would be greatly appreciated.
Thank you for your attention to this matter.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper