File size: 3,422 Bytes
65d7a68
e607abb
 
65d7a68
e607abb
 
 
 
 
65d7a68
 
e607abb
65d7a68
e607abb
65d7a68
e607abb
65d7a68
e607abb
 
 
 
65d7a68
e607abb
65d7a68
e607abb
65d7a68
e607abb
65d7a68
 
e607abb
 
 
65d7a68
e607abb
65d7a68
e607abb
65d7a68
e607abb
 
 
65d7a68
e607abb
 
 
 
 
65d7a68
e607abb
65d7a68
e607abb
65d7a68
e607abb
 
 
 
 
 
 
65d7a68
e607abb
65d7a68
e607abb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
base_model:
- Qwen/Qwen2.5-Math-1.5B
library_name: transformers
license: mit
pipeline_tag: text-generation
tags:
- reasoning
- Zero-RL
---

# 📖Introduction

![Github](https://img.shields.io/badge/LUFFY-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)

LUFFY is a reinforcement learning framework that bridges the gap between zero-RL and imitation learning by incorporating off-policy reasoning traces into the training process. Built upon GRPO, LUFFY combines on-policy rollouts with off-policy demonstrations during advantage estimation and introduces **policy shaping** via regularized importance sampling to emphasize low-probability yet crucial actions.

### Key Highlights:
- **Off-Policy Guidance:** Seamlessly integrates external reasoning traces to bootstrap learning from stronger models.
- **Dynamic Balance:** Learns when to imitate and when to explore, adapting over the course of training.
- **Policy Shaping:** Emphasizes important actions often ignored in standard policy gradients, enabling better generalization.

---

## Inference

Here’s an example of using LUFFY for inference:


```python
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_path="Elliott/LUFFY-Qwen-Math-7B-Zero"

question = "which number is larger? 9.11 or 9.9?"

tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [{"role": "user", "content": question}]
chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

llm = LLM(model=model_path)
params = SamplingParams(temperature=0.6, max_tokens=8192)
outputs = llm.generate([chat], params)
print(outputs[0].outputs[0].text)
```

---

# 📃Evaluation

| **Model** | **AIME 24** | **AIME 25** | **AMC** | **MATH-500** | **Minerva** | **Olympiad** | **Avg.** |
|-------|---------|---------|-----|----------|---------|----------|------|
| Qwen2.5-Math-1.5B-Base | 7.9 | 4.7 | 26.4 | 31.0 | 12.1 | 21.5 | 17.3 |
| Qwen2.5-Math-1.5B-Instruct | 11.4 | 8.5 | 47.4 | 75.2 | 27.6 | 38.7 | 34.8 |
| SFT | 15.2 | **14.3** | 43.5 | 74.8 | **30.9** | 36.9 | 40.3 |
| On-Policy RL | 12.6 | 6.5 | 42.6 | 68.8 | 22.1 | 34.4 | 36.1 |
| **LUFFY-1.5B-Zero** | **15.2** | 12.7 | **46.8** | **79.4** | 26.5 | **42.4** | **42.1** |

---

# 🌻Acknowledgement

LUFFY builds upon [veRL](https://github.com/volcengine/verl) and [deepscaler](https://github.com/agentica-project/rllm), and utilizes [vLLM](https://github.com/vllm-project/vllm) for inference. We utilize [Math-Verify](https://github.com/huggingface/Math-Verify) for math reasoning evaluation. We thank the open-source community for datasets and backbones, including [NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT), [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k), [Qwen2.5-Math](https://github.com/QwenLM/Qwen2.5-Math), and [DeepSeek-R1](https://github.com/deepseek-ai/deepseek-r1) model. 

Code: https://github.com/ElliottYan/LUFFY

# Citation
If you find our model, data, or evaluation code useful, please kindly cite our paper:
```bib
@misc{luffy,
      title={Learning to Reason under Off-Policy Guidance}, 
      author={Jianhao Yan and Yafu Li and Zican Hu and Zhi Wang and Ganqu Cui and Xiaoye Qu and Yu Cheng and Yue Zhang},
      year={2025},
      eprint={2504.14945},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.14945}, 
}
```