Intuitor
Collection
Models in the paper "Learning to Reason without External Rewards"
•
12 items
•
Updated
This model is a GRPO-fine-tuned version of allenai/OLMo-2-1124-7B-SFT
trained on the MATH dataset.
This model is associated with the paper Learning to Reason without External Rewards, which introduces Intuitor, a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty—the model’s own internal confidence—as the sole reward. This approach is built on a novel paradigm called Reinforcement Learning from Internal Feedback (RLIF), enabling models to learn without external rewards, gold labels, or verifiers by optimizing intrinsic signals.
You can load and use this model with the transformers
library:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH"
# It's recommended to load with bfloat16 for OLMo-2 models if supported by your hardware
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
# Example usage:
prompt = "Question: What is 2 + 2?
Answer:"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))
@article{zhao2025learning,
title={Learning to Reason without External Rewards},
author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
journal={arXiv preprint arXiv:2505.19590},
year={2025}
}