Update README.md
Browse files
README.md
CHANGED
|
@@ -6,6 +6,8 @@ base_model:
|
|
| 6 |
- Qwen/Qwen2.5-Math-7B-Instruct
|
| 7 |
pipeline_tag: text-classification
|
| 8 |
library_name: transformers
|
|
|
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
# PathFinder-PRM-7B
|
|
@@ -181,9 +183,54 @@ reward_score = run_inference(messages)
|
|
| 181 |
|
| 182 |
### Results
|
| 183 |
|
| 184 |
-
[
|
| 185 |
|
| 186 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
|
| 188 |
|
| 189 |
## Citation
|
|
@@ -198,4 +245,4 @@ reward_score = run_inference(messages)
|
|
| 198 |
primaryClass={cs.LG},
|
| 199 |
url={https://arxiv.org/abs/2505.12345},
|
| 200 |
}
|
| 201 |
-
```
|
|
|
|
| 6 |
- Qwen/Qwen2.5-Math-7B-Instruct
|
| 7 |
pipeline_tag: text-classification
|
| 8 |
library_name: transformers
|
| 9 |
+
datasets:
|
| 10 |
+
- declare-lab/PathFinder-600K
|
| 11 |
---
|
| 12 |
|
| 13 |
# PathFinder-PRM-7B
|
|
|
|
| 183 |
|
| 184 |
### Results
|
| 185 |
|
| 186 |
+

|
| 187 |
|
| 188 |
+
#### PRMBench Results
|
| 189 |
+
|
| 190 |
+
| Model | Simplicity | Soundness | Sensitivity | Overall |
|
| 191 |
+
|----------------------------------|------------|-----------|-------------|---------|
|
| 192 |
+
| **LLM-as-judge, Proprietary Language Models** | | | | |
|
| 193 |
+
| Gemini-2.0-thinking-exp-1219 | 66.2 | 71.8 | 75.3 | 68.8 |
|
| 194 |
+
| GPT-4o | 59.7 | 70.9 | 75.8 | 66.8 |
|
| 195 |
+
| **LLM-as-judge, Open-source Language Models** | | | | |
|
| 196 |
+
| Qwen-2.5-Math-72B | 55.1 | 61.1 | 67.1 | 57.4 |
|
| 197 |
+
| QwQ-Preview-32B | 56.4 | 68.2 | 73.5 | 63.6 |
|
| 198 |
+
| **Discriminative Process Reward Models** | | | | |
|
| 199 |
+
| Math-Shepherd-7B | 47.1 | 45.7 | 60.7 | 47.0 |
|
| 200 |
+
| Math-PSA-7B | 51.3 | 51.8 | 64.9 | 52.3 |
|
| 201 |
+
| RLHFlow-Mistral-8B | 46.7 | 57.5 | 68.5 | 54.4 |
|
| 202 |
+
| Lemma-PRM800k-7B | 51.4 | 50.9 | 66.0 | 52.0 |
|
| 203 |
+
| ReasonEval-7B | 55.5 | 63.9 | 71.0 | 60.0 |
|
| 204 |
+
| Qwen2.5-Math-PRM-7B | 52.1 | **71.0** | 75.5 | 65.5 |
|
| 205 |
+
| 🟢 PathFinder-PRM-7B | **58.9** | 70.8 | **76.9** | **67.7** |
|
| 206 |
+
|
| 207 |
+
Note: Simplicity, Soundness, and Sensitivity are averaged sub-metrics from PRMBench. Our model, PathFinder-PRM-7B, outperforms all open-source discriminative PRMs and LLM-as-judge models, while achieving competitive performance compared to large proprietary models.
|
| 208 |
+
|
| 209 |
+
|
| 210 |
+
#### ProcessBench Results
|
| 211 |
+
|
| 212 |
+
| Model | # Samples | GSM8K | MATH | Olympiad | OmniMath | Avg. F1 |
|
| 213 |
+
|-------------------------------|-----------|-------|-------|----------|----------|---------|
|
| 214 |
+
| Math-Shepherd-7B | 445K | 47.9 | 29.5 | 24.8 | 23.8 | 31.5 |
|
| 215 |
+
| RLHFlow-Mistral-8B | 273K | 50.4 | 33.4 | 13.8 | 15.8 | 28.4 |
|
| 216 |
+
| Llemma-PRM800K-7B | ~350K | 48.4 | 43.1 | 28.5 | 33.4 | 38.4 |
|
| 217 |
+
| Qwen2.5-Math-7B-PRM800K | 264K | 68.2 | 62.6 | 50.7 | 44.3 | 58.5 |
|
| 218 |
+
| 🟢 PathFinder-PRM-7B | ~400K | 77.9 | 75.3 | 65.0 | 59.7 | 69.5 |
|
| 219 |
+
| Qwen2.5-Math-PRM-7B | ~1.5M | 82.4 | 77.6 | 67.5 | 66.3 | 73.5 |
|
| 220 |
+
|
| 221 |
+
PathFinder-PRM-7B outperforms models trained on similar data sizes on ProcessBench but performs 4 points worse compared to Qwen2.5-Math-PRM-7B which was trained with 3x more data.
|
| 222 |
+
|
| 223 |
+
### Reward-Guided Greedy Search (PRM@8)
|
| 224 |
+
|
| 225 |
+
| Model | AIME24 | AMC23 | MATH | Olympiad | College | Minerva | Avg |
|
| 226 |
+
|------------------------------|--------|-------|-------|----------|---------|---------|-------|
|
| 227 |
+
| Math-Shepherd-7B | 13.3 | 52.5 | 74.6 | 38.5 | 36.5 | 41.2 | 42.8 |
|
| 228 |
+
| Math-PSA-7B | 6.7 | 57.5 | 79.8 | 42.5 | 41.0 | 39.3 | 44.5 |
|
| 229 |
+
| Skywork-PRM-7B | 10.0 | 57.5 | 77.8 | 41.5 | 39.0 | **43.4** | 44.9 |
|
| 230 |
+
| Qwen2.5-Math-PRM-7B | 16.7 | 60.0 | **81.0** | **43.5** | 39.0 | 40.4 | 46.8 |
|
| 231 |
+
| 🟢 PathFinder-PRM-7B | **20.0** | **62.5** | 78.8 | 36.5 | **55.0** | 36.7 | **48.3** |
|
| 232 |
+
|
| 233 |
+
Note: All results are computed using reward-guided greedy search with Qwen2.5‑7B‑Instruct as the policy model. PathFinder-PRM-7B, outperforms all open-source discriminative PRMs in Reward-Guided Greedy Search showcasing its ability to better guide policy models towards correct solutions
|
| 234 |
|
| 235 |
|
| 236 |
## Citation
|
|
|
|
| 245 |
primaryClass={cs.LG},
|
| 246 |
url={https://arxiv.org/abs/2505.12345},
|
| 247 |
}
|
| 248 |
+
```
|