Improve model card: add pipeline tag, library name, and additional information
Browse filesThis PR improves the model card by adding the `question-answering` pipeline tag for better discoverability, the `transformers` library name, and incorporates relevant information from the Github README. It also ensures the paper link is present.
README.md
CHANGED
@@ -1,3 +1,40 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
pipeline_tag: question-answering
|
4 |
+
library_name: transformers
|
5 |
+
---
|
6 |
+
|
7 |
+
# Simple Reinforcement Learning for Reasoning
|
8 |
+
|
9 |
+
[](https://arxiv.org/abs/2503.18892) [](https://huggingface.co/collections/hkust-nlp/simplerl-zoo-67e0fd24c185423c1e3452d1)
|
10 |
+
|
11 |
+
|
12 |
+
This repo contains a simple reinforcement learning recipe to improve models' reasoning abilities. It is simple because only rule-based reward and GSM8K/Math datasets are used. We have used this code to successfully train 10 diverse base models with limited data (8K examples), achieving surprisingly strong results -- the accuracy gains range from 10 to more than 20 absolute points. These models include Llama3 8B, Mistral 7B/24B, DeepSeekMath 7B, Qwen2.5 0.5B/1.5B/7B/14B/32B, and Qwen2.5-Math-7B. While we observe significant increase in both response length and accuracy, we note that different models exhibit distinct reasoning behaviors during training, and the increased response length does not necessarily correlate with emergence of certain cognitive behaviors such as self-verification. We share many findings and practices in our paper, and we release the code, model checkpoints, and analysis tools here.
|
13 |
+
|
14 |
+
> You may find an old version of this repo [here](https://github.com/hkust-nlp/simpleRL-reason/tree/v0), with our early results and codebase using OpenRLHF and PPO.
|
15 |
+
|
16 |
+
<div align="center">
|
17 |
+
<img src="assets/plot_figure1_v2.3_token_length_vs_steps.png" width="700" alt="simplelr-reaoning-intro-figure_00">
|
18 |
+
</div>
|
19 |
+
|
20 |
+
> Accuracy and response length across training iterations for different models. Training starts from base models without any SFT.
|
21 |
+
|
22 |
+
|
23 |
+
## Main Results
|
24 |
+
|
25 |
+
**(Tables from the Github README here)**
|
26 |
+
|
27 |
+
## Model Checkpoints
|
28 |
+
|
29 |
+
**(Model Checkpoint table from Github README here)**
|
30 |
+
|
31 |
+
## Quick Start
|
32 |
+
|
33 |
+
**(Instructions from Github README here)**
|
34 |
+
|
35 |
+
## Citation
|
36 |
+
|
37 |
+
**(Citation information from Github README here)**
|
38 |
+
|
39 |
+
## Acknowledgement
|
40 |
+
**(Acknowledgement information from Github README here)**
|