Qwen2.5-3B-GRPO-MATH-1EPOCH
This model is a GRPO-fine-tuned version of Qwen2.5-3B, trained on the MATH dataset, as presented in the paper Learning to Reason without External Rewards.
Intuitor is a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm called Reinforcement Learning from Internal Feedback (RLIF). This model represents an instance fine-tuned using the GRPO policy optimization algorithm within this framework.
RLIF enables LLMs to learn from intrinsic signals without external rewards or labeled data, offering a scalable alternative for autonomous AI systems where verifiable rewards are unavailable. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation.
Key Features
- Reinforcement Learning from Internal Feedback (RLIF): A framework enabling LLMs to learn from intrinsic signals without external rewards, gold labels, or verifiers.
- Self-Certainty as Reward: Intuitor uses the model's own confidence (self-certainty) as its sole reward signal.
- Mathematical Reasoning: Specifically fine-tuned on the MATH dataset to enhance mathematical reasoning capabilities.
- Code Generation: Demonstrates strong generalization to code generation tasks.
Usage
This model is compatible with the Hugging Face transformers
library. You can load and use it for text generation as follows:
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch
model_name = "sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Define a conversation prompt for mathematical reasoning
prompt = "Question: What is the sum of the first 100 positive integers?
Answer:"
# Apply the chat template suitable for Qwen models
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Encode the input
input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)
# Set generation configuration
generation_config = GenerationConfig(
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=2048,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
# Generate response
outputs = model.generate(input_ids, generation_config=generation_config)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Code
The official implementation and training scripts are available on the GitHub repository.
Citation
If you use this model or the associated research, please cite the paper:
@article{zhao2025learning,
title={Learning to Reason without External Rewards},
author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
journal={arXiv preprint arXiv:2505.19590},
year={2025}
}
@article{sha2024deepseekmath,
title = {DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
author = {Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and … Guo, Daya},
journal = {arXiv preprint arXiv:2402.03300},
year = {2024},
}
- Downloads last month
- 12
Model tree for sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH
Base model
Qwen/Qwen2.5-3B