lewtun HF Staff commited on
Commit
b382da0
·
verified ·
1 Parent(s): 9e1639f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -63
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  license: apache-2.0
3
  datasets:
4
- - open-r1/reasoning-mix
5
  language:
6
  - en
7
  base_model:
@@ -11,13 +11,15 @@ library_name: transformers
11
 
12
  <img src="open-r1-thumbnail.png" alt="Centered Image" style="display: block; margin: 0 auto;" width="300">
13
 
14
- # OpenR1-Distill-7B
15
 
16
- OpenR1-Distill-7B is a post-trained version of [Qwen/Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B) on [Mixture-of-Reasons](https://huggingface.co/datasets/open-r1/Mixture-of-Reasons); a curated dataset of 350k reasoning traces distilled from [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) in the domains of mathematics, coding, and science. This model matches or exceeds the performance of [DeepSeek's 7B distilled model](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)
 
 
17
 
18
  ## Model description
19
 
20
- - **Model type:** A 7B parameter GPT-like model fine-tuned on a mix of publicly available, synthetic datasets.
21
  - **Language(s) (NLP):** Primarily English
22
  - **License:** Apache 2.0
23
  - **Finetuned from model:** a [variant](https://huggingface.co/open-r1/Qwen2.5-Math-7B-RoPE-300k) of [Qwen/Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B), whose RoPE base frequency was extended to 300k to enable training on a context of 32k tokens.
@@ -26,99 +28,125 @@ OpenR1-Distill-7B is a post-trained version of [Qwen/Qwen2.5-Math-7B](https://hu
26
 
27
  <!-- Provide the basic links for the model. -->
28
 
29
- - **Repository:** https://github.com/huggingface/alignment-handbook
30
- - **Demo:** https://huggingface.co/spaces/HuggingFaceH4/zephyr-chat
31
- - **Chatbot Arena:** Evaluate Zephyr 7B against 10+ LLMs in the LMSYS arena: http://arena.lmsys.org
32
-
33
- ## Performance
34
-
35
- At the time of release, Zephyr-7B-β is the highest ranked 7B chat model on the [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks:
36
-
37
- | Model | AIME 2024 | MATH-500 | GPQA-D | LiveCodeBench |
38
- | :---- | :----: | :----: | :----: | :----: |
39
- | OpenR1-Distill-7B | 52.66 | 89 | 52.78 | X |
40
- | DeepSeek-R1-Distill-Qwen-7B | 51.25 | 93.45 | 52.4 | 37.41 |
41
 
42
- In particular, on several categories of MT-Bench, Zephyr-7B-β has strong performance compared to larger open models like Llama2-Chat-70B:
43
 
44
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6200d0a443eb0913fa2df7cc/raxvt5ma16d7T23my34WC.png)
45
 
46
- However, on more complex tasks like coding and mathematics, Zephyr-7B-β lags behind proprietary models and more research is needed to close the gap.
47
-
48
-
49
- ## Intended uses & limitations
50
 
51
- The model was initially fine-tuned on a filtered and preprocessed of the [`UltraChat`](https://huggingface.co/datasets/stingning/ultrachat) dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT.
52
- We then further aligned the model with [🤗 TRL's](https://github.com/huggingface/trl) `DPOTrainer` on the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset, which contains 64k prompts and model completions that are ranked by GPT-4. As a result, the model can be used for chat and you can check out our [demo](https://huggingface.co/spaces/HuggingFaceH4/zephyr-chat) to test its capabilities.
53
 
54
- You can find the datasets used for training Zephyr-7B-β [here](https://huggingface.co/collections/HuggingFaceH4/zephyr-7b-6538c6d6d5ddd1cbb1744a66)
 
 
 
 
 
 
55
 
56
- Here's how you can run the model using the `pipeline()` function from 🤗 Transformers:
57
 
58
  ```python
59
- # Install transformers from source - only needed for versions <= v4.34
60
- # pip install git+https://github.com/huggingface/transformers.git
61
- # pip install accelerate
62
-
63
  import torch
64
  from transformers import pipeline
65
 
66
- pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")
67
 
68
- # We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
69
  messages = [
70
- {
71
- "role": "system",
72
- "content": "You are a friendly chatbot who always responds in the style of a pirate",
73
- },
74
- {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
75
  ]
76
- prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
77
- outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
78
  print(outputs[0]["generated_text"])
79
- # <|system|>
80
- # You are a friendly chatbot who always responds in the style of a pirate.</s>
81
- # <|user|>
82
- # How many helicopters can a human eat in one sitting?</s>
83
- # <|assistant|>
84
- # Ah, me hearty matey! But yer question be a puzzler! A human cannot eat a helicopter in one sitting, as helicopters are not edible. They be made of metal, plastic, and other materials, not food!
85
  ```
86
 
87
- ## Bias, Risks, and Limitations
88
 
89
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
- Zephyr-7B-β has not been aligned to human preferences for safety within the RLHF phase or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so).
92
- It is also unknown what the size and composition of the corpus was used to train the base model (`mistralai/Mistral-7B-v0.1`), however it is likely to have included a mix of Web data and technical sources like books and code. See the [Falcon 180B model card](https://huggingface.co/tiiuae/falcon-180B#training-data) for an example of this.
93
 
 
94
 
95
- ## Training and evaluation data
96
 
 
 
 
 
 
 
 
 
97
 
98
  ### Training hyperparameters
99
 
100
  The following hyperparameters were used during training:
101
- - learning_rate: 5e-07
 
 
 
102
  - train_batch_size: 2
103
- - eval_batch_size: 4
 
104
  - seed: 42
105
- - distributed_type: multi-GPU
106
- - num_devices: 16
107
- - total_train_batch_size: 32
108
- - total_eval_batch_size: 64
109
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
110
- - lr_scheduler_type: linear
111
- - lr_scheduler_warmup_ratio: 0.1
112
- - num_epochs: 3.0
113
 
114
  ### Training results
115
 
 
 
 
116
 
117
  ### Framework versions
118
 
119
- - Transformers 4.35.0.dev0
120
- - Pytorch 2.0.1+cu118
121
- - Datasets 2.12.0
122
- - Tokenizers 0.14.0
 
 
 
 
 
 
 
 
 
123
 
124
  ## Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  datasets:
4
+ - open-r1/Mixture-of-Reasons
5
  language:
6
  - en
7
  base_model:
 
11
 
12
  <img src="open-r1-thumbnail.png" alt="Centered Image" style="display: block; margin: 0 auto;" width="300">
13
 
14
+ # Model summary
15
 
16
+ OpenR1-Distill-7B is post-trained version of [Qwen/Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B) on [Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts): a curated dataset of 350k verified reasoning traces distilled from [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1). The dataset spans tasks in mathematics, coding, and science, and is designed to teach language models to reason step-by-step.
17
+
18
+ OpenR1-Distill-7B replicates the reasoning capabilities of [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) while remaining fully open and reproducible. It is ideal for research on inference-time compute and reinforcement learning with verifiable rewards (RLVR).
19
 
20
  ## Model description
21
 
22
+ - **Model type:** A 7B parameter GPT-like model, post-trained on a mix of publicly available, synthetic datasets.
23
  - **Language(s) (NLP):** Primarily English
24
  - **License:** Apache 2.0
25
  - **Finetuned from model:** a [variant](https://huggingface.co/open-r1/Qwen2.5-Math-7B-RoPE-300k) of [Qwen/Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B), whose RoPE base frequency was extended to 300k to enable training on a context of 32k tokens.
 
28
 
29
  <!-- Provide the basic links for the model. -->
30
 
31
+ - **Repository:** https://github.com/huggingface/open-r1
32
+ - **Training logs:** https://wandb.ai/huggingface/open-r1/runs/199cum6l
 
 
 
 
 
 
 
 
 
 
33
 
34
+ ## Usage
35
 
36
+ To chat with the model, first install 🤗 Transformers:
37
 
38
+ ```shell
39
+ pip install transformers>0.52
40
+ ```
 
41
 
42
+ Then run the chat CLI as follows:
 
43
 
44
+ ```shell
45
+ transformers chat open-r1/OpenR1-Distill-7B \
46
+ max_new_tokens=2048 \
47
+ do_sample=True \
48
+ temperature=0.6 \
49
+ top_p=0.95
50
+ ```
51
 
52
+ Alternatively, run the model using the `pipeline()` function:
53
 
54
  ```python
 
 
 
 
55
  import torch
56
  from transformers import pipeline
57
 
58
+ pipe = pipeline("text-generation", model="open-r1/OpenR1-Distill-7B", torch_dtype=torch.bfloat16, device_map="auto")
59
 
 
60
  messages = [
61
+ {"role": "user", "content": "Which number is larger, 9.9 or 9.11?"},
 
 
 
 
62
  ]
63
+ outputs = pipe(messages, max_new_tokens=2048, do_sample=True, temperature=0.6, top_p=0.95, return_full_text=False)
 
64
  print(outputs[0]["generated_text"])
 
 
 
 
 
 
65
  ```
66
 
 
67
 
68
+ ## Performance
69
+
70
+ We use [Lighteval](https://github.com/huggingface/lighteval) to evaluate models on the following benchmarks:
71
+
72
+ | Model | AIME 2024 | MATH-500 | GPQA-D | LiveCodeBench |
73
+ |-----------------------------|-----------|----------|--------|---------------|
74
+ | OpenR1-Distill-7B | 52.7 | 89.0 | 52.8 | 39.4 |
75
+ | DeepSeek-R1-Distill-Qwen-7B | 51.3 | 93.5 | 52.4 | 37.4 |
76
+
77
+ All scores denote pass@1 accuracy and use sampling with `temperature=0.6` and `top_p=0.95`. The DeepSeek-R1 tech report uses sampling with 4-64 responses per query to estimate pass@1, but does not specify the specific number of responses per benchmark. In the table above, we estimate pass@1 accuracy with the following number of responses per query:
78
+
79
+ | Benchmark | Number of responses per query |
80
+ |:-------------:|:-----------------------------:|
81
+ | AIME 2024 | 64 |
82
+ | MATH-500 | 4 |
83
+ | GPQA Diamond | 8 |
84
+ | LiveCodeBench | 16 |
85
 
86
+ Note that for benchmarks like AIME 2024, it is important to sample many responses as there are only 30 problems and this introduces high variance across repeated runs. The choice of how many responses to sample per prompt likely explains the small differences between our evaluation results and those reported by DeepSeek. Check out the [`open-r1` repo](https://github.com/huggingface/open-r1?tab=readme-ov-file#evaluating-models) for instructions on how to reproduce these results.
 
87
 
88
+ ## Training methodology
89
 
90
+ OpenR1-Distill-7B was trained using supervised fine-tuning (SFT) on the [Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts) dataset, which contains reasoning traces distilled from [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1). To optimise the data mixture, we followed the same methodology described in the [Phi-4-reasoning tech report](https://huggingface.co/papers/2504.21318), namely that mixtures can be optimised independently per domain, and then combined into a single dataset. The figure below shows evolution of our experiments on the math and code domains:
91
 
92
+ <img src="data_mixture.png" alt="Centered Image" style="display: block; margin: 0 auto;">
93
+
94
+ The individual experiments correspond to the following:
95
+
96
+ * exp1 - exp3: extending the model's base RoPE frequency from 10k to 100k, 200k, and 300k respectively.
97
+ * exp4 - exp6: scaling the learning rate on the math and code mixtures from 1e-5 to 2e-5, and 4e-5 respectively.
98
+ * exp7 - exp8: measuring the impact of sequence packing (exp7) versus no packing (exp8) on the math mixture.
99
+ * exp9 - exp10: measuring the impact of training on all three mixtures (math, code, and science) versus training on math and code only.
100
 
101
  ### Training hyperparameters
102
 
103
  The following hyperparameters were used during training:
104
+
105
+ - num_epochs: 5.0
106
+ - learning_rate: 4.0e-05
107
+ - num_devices: 8
108
  - train_batch_size: 2
109
+ - gradient_accumulation_steps: 8
110
+ - total_train_batch_size: 2 * 8 * 8 = 128
111
  - seed: 42
112
+ - distributed_type: DeepSpeed ZeRO-3
 
 
 
113
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
114
+ - lr_scheduler_type: cosine_with_min_lr with min_lr_rate=0.1
115
+ - lr_scheduler_warmup_ratio: 0.03
116
+ - max_grad_norm: 0.2
117
 
118
  ### Training results
119
 
120
+ During training, we monitor progress on AIME 2024, GPQA Diamond, and LiveCodeBench v4 every epoch. We use LiveCodeBench v4 to accelerate evaluation as it contains fewer problems than v5, yet is still representative of the full benchmark. The following plot shows the training results:
121
+
122
+ <img src="train_results.png" alt="Centered Image" style="display: block; margin: 0 auto;">
123
 
124
  ### Framework versions
125
 
126
+ - Platform: Linux-5.15.0-1049-aws-x86_64-with-glibc2.31
127
+ - Python version: 3.11.11
128
+ - TRL version: 0.18.0.dev0
129
+ - PyTorch version: 2.6.0
130
+ - Transformers version: 4.52.0.dev0
131
+ - Accelerate version: 1.4.0
132
+ - Datasets version: 3.5.1
133
+ - HF Hub version: 0.30.2
134
+ - bitsandbytes version: 0.45.5
135
+ - DeepSpeed version: 0.16.8
136
+ - Liger-Kernel version: 0.5.9
137
+ - OpenAI version: 1.76.2
138
+ - vLLM version: 0.8.4
139
 
140
  ## Citation
141
+
142
+ If you find this model is useful in your own work, please consider citing it as follows:
143
+
144
+ ```bibtex
145
+ @misc{openr1,
146
+ title = {Open R1: A fully open reproduction of DeepSeek-R1},
147
+ url = {https://github.com/huggingface/open-r1},
148
+ author = {Hugging Face},
149
+ month = {January},
150
+ year = {2025}
151
+ }
152
+ ```