Papers
arxiv:2501.12599

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Published on Jan 22
· Submitted by akhaliq on Jan 23
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).

Community

Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

hi

No open weights??
If exists please share link.

Kumita ng unlimited income Mga travel incentives at libreng gadgets Mga skills at kaalaman sa financial planning Makaka-impact sa buhay ng iba.png

Missing Citations: Kimi K1.5 Training Loss Function Mirrors SPPO/GPO Loss

I’ve been reviewing the training loss function presented in the Kimi K1.5 paper and have serious concerns regarding its originality. It appears that the loss function is nearly identical in form to the one introduced in the SPPO/GPO work. Specifically, both formulations involve a squared error between a log-probability ratio and a scaled reward (or preference score) term, along with a normalization factor. Here’s a side-by-side comparison:

GPO/SPPO Loss

(Excerpt from GPO Paper (https://arxiv.org/abs/2410.02197) and SPPO Paper (https://arxiv.org/abs/2405.00675))

θt+1=argminθExX,yπθt(x)[(logπθ(yx)πθt(yx)1β(s^(yπθtx)logZπθt(x)))2] \theta_{t+1} = \arg \min_{\theta} \mathbb{E}_{\mathbf{x} \sim \mathcal{X}, \mathbf{y} \sim \pi_{\theta_t}(\cdot \mid \mathbf{x})} \left[ \left( \log \frac{\pi_{\theta}(\mathbf{y} \mid \mathbf{x})}{\pi_{\theta_t}(\mathbf{y} \mid \mathbf{x})} -\frac{1}{\beta} \Big( \widehat{s}\left(\mathbf{y} \succ \pi_{\theta_t} \mid \mathbf{x}\right) - \log Z_{\pi_{\theta_t}}(\mathbf{x}) \Big) \right)^2 \right]

Kimi K1.5 Loss

(Excerpt from the Kimi K1.5 Paper)

L(θ)=E(x,y)D ⁣[E(y,z)πθi ⁣[(r(x,y,y)τlogZτlogπθ(y,zx)πθi(y,zx))2]] L(\theta)=\mathbb{E}_{(x, y^*) \sim \mathcal{D}}\!\left[\mathbb{E}_{(y, z) \sim \pi_{\theta_i}}\!\left[\left(r\left(x, y, y^*\right)-\tau \log Z - \tau \log \frac{\pi_\theta(y, z \mid x)}{\pi_{\theta_i}(y, z \mid x)}\right)^2\right]\right]

While the notations differ slightly (e.g., $r(x,y,y^*)$ vs. $\widehat{s}\left(\mathbf{y} \succ \pi_{\theta_t} \mid \mathbf{x}\right)$ and $\tau$ vs. $1/\beta$, the structural similarity is striking. Both losses adjust the policy by matching a log-probability ratio to a reward (or preference score) signal, with a normalization constant to stabilize training.


Questions/Concerns:

  1. Justification for Differences:
    If there are intended differences (e.g., in reward computation or sampling strategy), could the authors clearly delineate these differences?

  2. Proper Attribution and Citation:
    The training loss function in Kimi K1.5 appears to directly mirror that of the SPPO/GPO work. Could the authors update the manuscript to include explicit citations to the original SPPO/GPO papers? Proper attribution is crucial to maintain academic integrity and give due credit for prior work.


Request for Clarification:

This issue is raised to ensure academic integrity and clarity on the contributions of the Kimi K1.5 paper. A prompt explanation or pointer to supplementary material addressing these concerns, including a revision of citations, would be greatly appreciated.

Thank you for your attention to this matter.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.12599 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.12599 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.12599 in a Space README.md to link it from this page.

Collections including this paper 17