Safetensors
qwen2

EurusPRM-Stage2

Links

Introduction

EurusPRM-Stage2 is trained using Implicit PRM, which obtains free process rewards at no additional cost but just needs to simply train an ORM on the cheaper response-level labels. During inference, implicit process rewards are obtained by forward passing and calculating the log-likelihood ratio on each step.

prm

The key ingredient of Implicit PRM is the reward representation, as demonstrated below:

Downloads last month
5,045
Safetensors
Model size
7.62B params
Tensor type
BF16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for PRIME-RL/EurusPRM-Stage2

Quantizations
4 models