declare-lab
/

PathFinder-PRM-7B

@@ -6,6 +6,8 @@ base_model:
 - Qwen/Qwen2.5-Math-7B-Instruct
 pipeline_tag: text-classification
 library_name: transformers
 ---
 # PathFinder-PRM-7B
@@ -181,9 +183,54 @@ reward_score = run_inference(messages)
 ### Results
-[More Information Needed]
-<!-- #### Summary -->
 ## Citation
@@ -198,4 +245,4 @@ reward_score = run_inference(messages)
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2505.12345},
 }
-```

 - Qwen/Qwen2.5-Math-7B-Instruct
 pipeline_tag: text-classification
 library_name: transformers
+datasets:
+- declare-lab/PathFinder-600K
 ---
 # PathFinder-PRM-7B
 ### Results
+![benchmark_comparison.png](images/benchmark_comparison.png)
+#### PRMBench Results
+| Model                             | Simplicity | Soundness | Sensitivity | Overall |
+|----------------------------------|------------|-----------|-------------|---------|
+| **LLM-as-judge, Proprietary Language Models**    | |  | |  |
+| Gemini-2.0-thinking-exp-1219    | 66.2 | 71.8 | 75.3 | 68.8    |
+| GPT-4o                           | 59.7       | 70.9      | 75.8        | 66.8    |
+| **LLM-as-judge, Open-source Language Models**    | |  | |  |
+| Qwen-2.5-Math-72B                | 55.1       | 61.1      | 67.1        | 57.4    |
+| QwQ-Preview-32B                  | 56.4       | 68.2      | 73.5        | 63.6    |
+| **Discriminative Process Reward Models**    | |  | |  |
+| Math-Shepherd-7B                | 47.1       | 45.7      | 60.7        | 47.0    |
+| Math-PSA-7B                      | 51.3       | 51.8      | 64.9        | 52.3    |
+| RLHFlow-Mistral-8B              | 46.7       | 57.5      | 68.5        | 54.4    |
+| Lemma-PRM800k-7B                | 51.4       | 50.9      | 66.0        | 52.0    |
+| ReasonEval-7B                   | 55.5       | 63.9      | 71.0        | 60.0    |
+| Qwen2.5-Math-PRM-7B             | 52.1       | **71.0**      | 75.5        | 65.5    |
+| 🟢 PathFinder-PRM-7B             | **58.9**       | 70.8      | **76.9**        | **67.7**    |
+Note: Simplicity, Soundness, and Sensitivity are averaged sub-metrics from PRMBench. Our model, PathFinder-PRM-7B, outperforms all open-source discriminative PRMs and LLM-as-judge models, while achieving competitive performance compared to large proprietary models.
+#### ProcessBench Results
+| Model                          | # Samples | GSM8K | MATH  | Olympiad | OmniMath | Avg. F1 |
+|-------------------------------|-----------|-------|-------|----------|----------|---------|
+| Math-Shepherd-7B              | 445K      | 47.9  | 29.5  | 24.8     | 23.8     | 31.5    |
+| RLHFlow-Mistral-8B            | 273K      | 50.4  | 33.4  | 13.8     | 15.8     | 28.4    |
+| Llemma-PRM800K-7B             | ~350K     | 48.4  | 43.1  | 28.5     | 33.4     | 38.4    |
+| Qwen2.5-Math-7B-PRM800K       | 264K      | 68.2  | 62.6  | 50.7     | 44.3     | 58.5    |
+| 🟢 PathFinder-PRM-7B          | ~400K     | 77.9  | 75.3  | 65.0     | 59.7     | 69.5    |
+| Qwen2.5-Math-PRM-7B           | ~1.5M     | 82.4  | 77.6  | 67.5     | 66.3     | 73.5    |
+PathFinder-PRM-7B outperforms models trained on similar data sizes on ProcessBench but performs 4 points worse compared to Qwen2.5-Math-PRM-7B which was trained with 3x more data.
+### Reward-Guided Greedy Search (PRM@8)
+| Model                         | AIME24 | AMC23 | MATH  | Olympiad | College | Minerva | Avg   |
+|------------------------------|--------|-------|-------|----------|---------|---------|-------|
+| Math-Shepherd-7B             | 13.3   | 52.5  | 74.6  | 38.5     | 36.5    | 41.2    | 42.8  |
+| Math-PSA-7B                  | 6.7    | 57.5  | 79.8  | 42.5     | 41.0    | 39.3    | 44.5  |
+| Skywork-PRM-7B               | 10.0   | 57.5  | 77.8  | 41.5     | 39.0    | **43.4**    | 44.9  |
+| Qwen2.5-Math-PRM-7B          | 16.7   | 60.0  | **81.0**  | **43.5**     | 39.0    | 40.4    | 46.8  |
+| 🟢 PathFinder-PRM-7B       | **20.0**   | **62.5**  | 78.8  | 36.5     | **55.0**    | 36.7    | **48.3** |
+Note: All results are computed using reward-guided greedy search with Qwen2.5‑7B‑Instruct as the policy model. PathFinder-PRM-7B, outperforms all open-source discriminative PRMs in Reward-Guided Greedy Search showcasing its ability to better guide policy models towards correct solutions
 ## Citation
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2505.12345},
 }
+```