Update README.md
Browse files
README.md
CHANGED
@@ -6,6 +6,8 @@ base_model:
|
|
6 |
- Qwen/Qwen2.5-Math-7B-Instruct
|
7 |
pipeline_tag: text-classification
|
8 |
library_name: transformers
|
|
|
|
|
9 |
---
|
10 |
|
11 |
# PathFinder-PRM-7B
|
@@ -181,9 +183,54 @@ reward_score = run_inference(messages)
|
|
181 |
|
182 |
### Results
|
183 |
|
184 |
-
[
|
185 |
|
186 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
187 |
|
188 |
|
189 |
## Citation
|
@@ -198,4 +245,4 @@ reward_score = run_inference(messages)
|
|
198 |
primaryClass={cs.LG},
|
199 |
url={https://arxiv.org/abs/2505.12345},
|
200 |
}
|
201 |
-
```
|
|
|
6 |
- Qwen/Qwen2.5-Math-7B-Instruct
|
7 |
pipeline_tag: text-classification
|
8 |
library_name: transformers
|
9 |
+
datasets:
|
10 |
+
- declare-lab/PathFinder-600K
|
11 |
---
|
12 |
|
13 |
# PathFinder-PRM-7B
|
|
|
183 |
|
184 |
### Results
|
185 |
|
186 |
+

|
187 |
|
188 |
+
#### PRMBench Results
|
189 |
+
|
190 |
+
| Model | Simplicity | Soundness | Sensitivity | Overall |
|
191 |
+
|----------------------------------|------------|-----------|-------------|---------|
|
192 |
+
| **LLM-as-judge, Proprietary Language Models** | | | | |
|
193 |
+
| Gemini-2.0-thinking-exp-1219 | 66.2 | 71.8 | 75.3 | 68.8 |
|
194 |
+
| GPT-4o | 59.7 | 70.9 | 75.8 | 66.8 |
|
195 |
+
| **LLM-as-judge, Open-source Language Models** | | | | |
|
196 |
+
| Qwen-2.5-Math-72B | 55.1 | 61.1 | 67.1 | 57.4 |
|
197 |
+
| QwQ-Preview-32B | 56.4 | 68.2 | 73.5 | 63.6 |
|
198 |
+
| **Discriminative Process Reward Models** | | | | |
|
199 |
+
| Math-Shepherd-7B | 47.1 | 45.7 | 60.7 | 47.0 |
|
200 |
+
| Math-PSA-7B | 51.3 | 51.8 | 64.9 | 52.3 |
|
201 |
+
| RLHFlow-Mistral-8B | 46.7 | 57.5 | 68.5 | 54.4 |
|
202 |
+
| Lemma-PRM800k-7B | 51.4 | 50.9 | 66.0 | 52.0 |
|
203 |
+
| ReasonEval-7B | 55.5 | 63.9 | 71.0 | 60.0 |
|
204 |
+
| Qwen2.5-Math-PRM-7B | 52.1 | **71.0** | 75.5 | 65.5 |
|
205 |
+
| 🟢 PathFinder-PRM-7B | **58.9** | 70.8 | **76.9** | **67.7** |
|
206 |
+
|
207 |
+
Note: Simplicity, Soundness, and Sensitivity are averaged sub-metrics from PRMBench. Our model, PathFinder-PRM-7B, outperforms all open-source discriminative PRMs and LLM-as-judge models, while achieving competitive performance compared to large proprietary models.
|
208 |
+
|
209 |
+
|
210 |
+
#### ProcessBench Results
|
211 |
+
|
212 |
+
| Model | # Samples | GSM8K | MATH | Olympiad | OmniMath | Avg. F1 |
|
213 |
+
|-------------------------------|-----------|-------|-------|----------|----------|---------|
|
214 |
+
| Math-Shepherd-7B | 445K | 47.9 | 29.5 | 24.8 | 23.8 | 31.5 |
|
215 |
+
| RLHFlow-Mistral-8B | 273K | 50.4 | 33.4 | 13.8 | 15.8 | 28.4 |
|
216 |
+
| Llemma-PRM800K-7B | ~350K | 48.4 | 43.1 | 28.5 | 33.4 | 38.4 |
|
217 |
+
| Qwen2.5-Math-7B-PRM800K | 264K | 68.2 | 62.6 | 50.7 | 44.3 | 58.5 |
|
218 |
+
| 🟢 PathFinder-PRM-7B | ~400K | 77.9 | 75.3 | 65.0 | 59.7 | 69.5 |
|
219 |
+
| Qwen2.5-Math-PRM-7B | ~1.5M | 82.4 | 77.6 | 67.5 | 66.3 | 73.5 |
|
220 |
+
|
221 |
+
PathFinder-PRM-7B outperforms models trained on similar data sizes on ProcessBench but performs 4 points worse compared to Qwen2.5-Math-PRM-7B which was trained with 3x more data.
|
222 |
+
|
223 |
+
### Reward-Guided Greedy Search (PRM@8)
|
224 |
+
|
225 |
+
| Model | AIME24 | AMC23 | MATH | Olympiad | College | Minerva | Avg |
|
226 |
+
|------------------------------|--------|-------|-------|----------|---------|---------|-------|
|
227 |
+
| Math-Shepherd-7B | 13.3 | 52.5 | 74.6 | 38.5 | 36.5 | 41.2 | 42.8 |
|
228 |
+
| Math-PSA-7B | 6.7 | 57.5 | 79.8 | 42.5 | 41.0 | 39.3 | 44.5 |
|
229 |
+
| Skywork-PRM-7B | 10.0 | 57.5 | 77.8 | 41.5 | 39.0 | **43.4** | 44.9 |
|
230 |
+
| Qwen2.5-Math-PRM-7B | 16.7 | 60.0 | **81.0** | **43.5** | 39.0 | 40.4 | 46.8 |
|
231 |
+
| 🟢 PathFinder-PRM-7B | **20.0** | **62.5** | 78.8 | 36.5 | **55.0** | 36.7 | **48.3** |
|
232 |
+
|
233 |
+
Note: All results are computed using reward-guided greedy search with Qwen2.5‑7B‑Instruct as the policy model. PathFinder-PRM-7B, outperforms all open-source discriminative PRMs in Reward-Guided Greedy Search showcasing its ability to better guide policy models towards correct solutions
|
234 |
|
235 |
|
236 |
## Citation
|
|
|
245 |
primaryClass={cs.LG},
|
246 |
url={https://arxiv.org/abs/2505.12345},
|
247 |
}
|
248 |
+
```
|