Update README.md
Browse files
README.md
CHANGED
|
@@ -33,6 +33,10 @@ base_model:
|
|
| 33 |
Polaris is an open-source post-training method that uses reinforcement learning (RL) scaling to refine and enhance models with advanced reasoning abilities. Our research shows that even top-tier models like Qwen3-4B can achieve significant improvements on challenging reasoning tasks when optimized with Polaris.
|
| 34 |
By leveraging open-source data and academic-level resources, Polaris pushes the capabilities of open-recipe reasoning models to unprecedented heights. In benchmark tests, our method even surpasses top commercial systems, including Claude-4-Opus, Grok-3-Beta, and o3-mini-high (2025/01/03).
|
| 35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
## Polaris's Recipe
|
| 37 |
- **Data Difficulty:** Before training, Polaris analyzes and maps the distribution of data difficulty. The dataset should not be overwhelmed by either overly difficult or trivially easy problems. We recommend using a data distribution with a slight bias toward challenging problems, which typically exhibits a mirrored J-shaped distribution.
|
| 38 |
- **Diversity-Based Rollout:** We leverage the *diversity among rollouts* to initialize the sampling temperature, which is then progressively increased throughout the RL training stages.
|
|
|
|
| 33 |
Polaris is an open-source post-training method that uses reinforcement learning (RL) scaling to refine and enhance models with advanced reasoning abilities. Our research shows that even top-tier models like Qwen3-4B can achieve significant improvements on challenging reasoning tasks when optimized with Polaris.
|
| 34 |
By leveraging open-source data and academic-level resources, Polaris pushes the capabilities of open-recipe reasoning models to unprecedented heights. In benchmark tests, our method even surpasses top commercial systems, including Claude-4-Opus, Grok-3-Beta, and o3-mini-high (2025/01/03).
|
| 35 |
|
| 36 |
+
<div align="center">
|
| 37 |
+
<img src="https://raw.githubusercontent.com/ChenxinAn-fdu/POLARIS/main/figs/aime25.png" alt="performance" style="width:60%;">
|
| 38 |
+
</div>
|
| 39 |
+
|
| 40 |
## Polaris's Recipe
|
| 41 |
- **Data Difficulty:** Before training, Polaris analyzes and maps the distribution of data difficulty. The dataset should not be overwhelmed by either overly difficult or trivially easy problems. We recommend using a data distribution with a slight bias toward challenging problems, which typically exhibits a mirrored J-shaped distribution.
|
| 42 |
- **Diversity-Based Rollout:** We leverage the *diversity among rollouts* to initialize the sampling temperature, which is then progressively increased throughout the RL training stages.
|