File size: 884 Bytes
40ba96d a546ca4 cdc752a 40ba96d cdc752a 944420c cdc752a a546ca4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
---
license: mit
library_name: transformers
pipeline_tag: text-generation
---
The base Qwen2.5-Math-7B model used by LUFFY, described in [Learning to Reason under Off-Policy Guidance](https://huggingface.co/papers/2504.14945).
We change to rope_theta from 10000 to 40000 and extend the context window to 16k.
Also, we modify the chat_template for the system prompt and add <think>.
Github: https://github.com/ElliottYan/LUFFY
# Citation
If you find our model, data, or evaluation code useful, please kindly cite our paper:
```bib
@misc{luffy,
title={Learning to Reason under Off-Policy Guidance},
author={Jianhao Yan and Yafu Li and Zican Hu and Zhi Wang and Ganqu Cui and Xiaoye Qu and Yu Cheng and Yue Zhang},
year={2025},
eprint={2504.14945},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.14945},
}
``` |