nielsr HF Staff commited on
Commit
901f023
·
verified ·
1 Parent(s): b858bf6

Improve model card: Add pipeline tag, library name, link to paper

Browse files

This PR improves the model card by adding a `pipeline_tag`, `library_name`, and a link to the paper. This will make the model easier to discover and understand on the Hub.

Files changed (1) hide show
  1. README.md +40 -3
README.md CHANGED
@@ -1,3 +1,40 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: question-answering
5
+ ---
6
+
7
+ # Simple Reinforcement Learning for Reasoning
8
+
9
+ [![Paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2503.18892) [![Hugging Face](https://img.shields.io/badge/SimpleRL_Zoo-fcd022?style=for-the-badge&logo=Huggingface&logoColor=000)](https://huggingface.co/collections/hkust-nlp/simplerl-zoo-67e0fd24c185423c1e3452d1)
10
+
11
+
12
+ This repo contains a simple reinforcement learning recipe to improve models' reasoning abilities. It is simple because only rule-based reward and GSM8K/Math datasets are used. We have used this code to successfully train 10 diverse base models with limited data (8K examples), achieving surprisingly strong results -- the accuracy gains range from 10 to more than 20 absolute points. These models include Llama3 8B, Mistral 7B/24B, DeepSeekMath 7B, Qwen2.5 0.5B/1.5B/7B/14B/32B, and Qwen2.5-Math-7B. While we observe significant increase in both response length and accuracy, we note that different models exhibit distinct reasoning behaviors during training, and the increased response length does not necessarily correlate with emergence of certain cognitive behaviors such as self-verification. We share many findings and practices in our paper, and we release the code, model checkpoints, and analysis tools here.
13
+
14
+ > You may find an old version of this repo [here](https://github.com/hkust-nlp/simpleRL-reason/tree/v0), with our early results and codebase using OpenRLHF and PPO.
15
+
16
+ <div align="center">
17
+ <img src="assets/plot_figure1_v2.3_token_length_vs_steps.png" width="700" alt="simplelr-reaoning-intro-figure_00">
18
+ </div>
19
+
20
+ > Accuracy and response length across training iterations for different models. Training starts from base models without any SFT.
21
+
22
+ ## News
23
+ - **[2025/03/24]** We perform successful zero RL training starting from 10 diverse base models. We release all 10 models and the code, and share many findings and practices in our [paper](https://arxiv.org/abs/2503.18892).
24
+ - **[2025/02/19]** We release checkpoints of [Qwen-2.5-Math-7B-SimpleRL-Zero](https://huggingface.co/hkust-nlp/Qwen-2.5-Math-7B-SimpleRL-Zero) and [Qwen-2.5-Math-7B-SimpleRL](https://huggingface.co/hkust-nlp/Qwen-2.5-Math-7B-SimpleRL) to Huggingface.
25
+ - **[2025/01/25]** We release the training/eval code and our blog. We are working on the paper and will release it very soon.
26
+
27
+
28
+ ## Links
29
+
30
+ * **Paper: SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild**
31
+ * 📝 [Paper](https://arxiv.org/abs/2503.18892)
32
+ * 🤗 [Hugging Face Collection](https://huggingface.co/collections/hkust-nlp/simplerl-zoo-67e0fd24c185423c1e3452d1)
33
+ * 💻 [Github](https://github.com/hkust-nlp/simpleRL-reason/tree/v1)
34
+
35
+ * **Blog: 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient**
36
+ * 📝 [Blog](https://hkust-nlp.notion.site/simplerl-reason)
37
+ * 🤗 [Hugging Face Collection](https://huggingface.co/collections/hkust-nlp/simplerl-67b543892b2ec6908ffff710)
38
+ * 💻 [Github](https://github.com/hkust-nlp/simpleRL-reason/tree/v0)
39
+
40
+ **(The rest of the README content remains unchanged)**