Qwen2.5-3B-GRPO-MATH-1EPOCH

This model is a GRPO-fine-tuned version of Qwen2.5-3B, trained on the MATH dataset, as presented in the paper Learning to Reason without External Rewards.

Intuitor is a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty—the model’s own internal confidence—as the sole reward. It is built on a novel paradigm called Reinforcement Learning from Internal Feedback (RLIF). This model represents an instance fine-tuned using the GRPO policy optimization algorithm within this framework.

RLIF enables LLMs to learn from intrinsic signals without external rewards or labeled data, offering a scalable alternative for autonomous AI systems where verifiable rewards are unavailable. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation.

Key Features

Reinforcement Learning from Internal Feedback (RLIF): A framework enabling LLMs to learn from intrinsic signals without external rewards, gold labels, or verifiers.
Self-Certainty as Reward: Intuitor uses the model's own confidence (self-certainty) as its sole reward signal.
Mathematical Reasoning: Specifically fine-tuned on the MATH dataset to enhance mathematical reasoning capabilities.
Code Generation: Demonstrates strong generalization to code generation tasks.

Usage

This model is compatible with the Hugging Face transformers library. You can load and use it for text generation as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

model_name = "sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Define a conversation prompt for mathematical reasoning
prompt = "Question: What is the sum of the first 100 positive integers?
Answer:"

# Apply the chat template suitable for Qwen models
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Encode the input
input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)

# Set generation configuration
generation_config = GenerationConfig(
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=2048,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

# Generate response
outputs = model.generate(input_ids, generation_config=generation_config)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(response)

Code

The official implementation and training scripts are available on the GitHub repository.

Citation

If you use this model or the associated research, please cite the paper:

@article{zhao2025learning,
  title={Learning to Reason without External Rewards},
  author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal={arXiv preprint arXiv:2505.19590},
  year={2025}
}

@article{sha2024deepseekmath,
  title     = {DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
  author    = {Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and … Guo, Daya},
  journal   = {arXiv preprint arXiv:2402.03300},
  year      = {2024},
}

sunblaze-ucb
/

Qwen2.5-3B-GRPO-MATH-1EPOCH

Qwen2.5-3B-GRPO-MATH-1EPOCH

Key Features

Usage

Code

Citation

Model tree for sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH

Collection including sunblaze-ucb/Qwen2.5-3B-GRPO-MATH-1EPOCH

Intuitor