|
--- |
|
license: mit |
|
--- |
|
## TMLR-Group-HF/GT-Qwen3-8B-Base |
|
|
|
This is the Qwen3-8B-Base model trained by GRPO Ground Truth method using MATH training set. |
|
|
|
If you are interested in Co-Reward, you can find more details on our Github Repo [https://github.com/tmlr-group/Co-Reward]. |
|
|
|
## Citation |
|
|
|
``` |
|
@article{zhang2025coreward, |
|
title={Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement}, |
|
author={Zizhuo Zhang and Jianing Zhu and Xinmu Ge and Zihua Zhao and Zhanke Zhou and Xuan Li and Xiao Feng and Jiangchao Yao and Bo Han}, |
|
journal={arXiv preprint arXiv:2508.00410} |
|
year={2025}, |
|
} |
|
``` |