nielsr HF Staff commited on
Commit
a18c3a1
·
verified ·
1 Parent(s): d630142

Improve model card: add pipeline tag, library name, and additional information

Browse files

This PR improves the model card by adding the `question-answering` pipeline tag for better discoverability, the `transformers` library name, and incorporates relevant information from the Github README. It also ensures the paper link is present.

Files changed (1) hide show
  1. README.md +40 -3
README.md CHANGED
@@ -1,3 +1,40 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: question-answering
4
+ library_name: transformers
5
+ ---
6
+
7
+ # Simple Reinforcement Learning for Reasoning
8
+
9
+ [![Paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2503.18892) [![Hugging Face](https://img.shields.io/badge/SimpleRL_Zoo-fcd022?style=for-the-badge&logo=Huggingface&logoColor=000)](https://huggingface.co/collections/hkust-nlp/simplerl-zoo-67e0fd24c185423c1e3452d1)
10
+
11
+
12
+ This repo contains a simple reinforcement learning recipe to improve models' reasoning abilities. It is simple because only rule-based reward and GSM8K/Math datasets are used. We have used this code to successfully train 10 diverse base models with limited data (8K examples), achieving surprisingly strong results -- the accuracy gains range from 10 to more than 20 absolute points. These models include Llama3 8B, Mistral 7B/24B, DeepSeekMath 7B, Qwen2.5 0.5B/1.5B/7B/14B/32B, and Qwen2.5-Math-7B. While we observe significant increase in both response length and accuracy, we note that different models exhibit distinct reasoning behaviors during training, and the increased response length does not necessarily correlate with emergence of certain cognitive behaviors such as self-verification. We share many findings and practices in our paper, and we release the code, model checkpoints, and analysis tools here.
13
+
14
+ > You may find an old version of this repo [here](https://github.com/hkust-nlp/simpleRL-reason/tree/v0), with our early results and codebase using OpenRLHF and PPO.
15
+
16
+ <div align="center">
17
+ <img src="assets/plot_figure1_v2.3_token_length_vs_steps.png" width="700" alt="simplelr-reaoning-intro-figure_00">
18
+ </div>
19
+
20
+ > Accuracy and response length across training iterations for different models. Training starts from base models without any SFT.
21
+
22
+
23
+ ## Main Results
24
+
25
+ **(Tables from the Github README here)**
26
+
27
+ ## Model Checkpoints
28
+
29
+ **(Model Checkpoint table from Github README here)**
30
+
31
+ ## Quick Start
32
+
33
+ **(Instructions from Github README here)**
34
+
35
+ ## Citation
36
+
37
+ **(Citation information from Github README here)**
38
+
39
+ ## Acknowledgement
40
+ **(Acknowledgement information from Github README here)**