---
base_model: Qwen/Qwen2.5-VL-3B-Instruct
library_name: transformers
license: other
tags:
- llama-factory
- full
- generated_from_trainer
- vision-language-model
model-index:
- name: Qwen2.5-VL-3B-Instruct
  results: []
pipeline_tag: image-text-to-text
---

# Qwen2.5-VL-3B-Instruct: Self-Rewarding Vision-Language Model via Reasoning Decomposition

This model is a fine-tuned version of [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) on the mllm_data1_cotOnly and the mllm_data1_description_val_text_only datasets. It was presented in the paper [Self-Rewarding Vision-Language Model via Reasoning Decomposition](https://huggingface.co/papers/2508.19652).

Code: https://github.com/zli12321/Vision-SR1

## Abstract

Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.

## Model description

Vision-SR1 is a self-rewarded Reinforcement Learning (RL) training framework that decomposes Vision-Language Models' (VLMs) language reasoning into visual perception reasoning and language reasoning. Inspired by works like Vision-R1, Visionary-R1, and R1-VL, Vision-SR1 leverages the VLM's self-evolving and reasoning ability to reward itself.

VLMs often rely primarily on language reasoning rather than visual perception because they fuse the vision encoder with the LLM backbone late in pretraining. Standard RL training can lead to recalling prior language knowledge for accuracy gains while neglecting vision. External LLM-based perception rewards can help but introduce bias and heavy latency. Vision-SR1 proposes a self-reward framework, enabling the model to provide its own visual and reasoning feedback with no latency, thereby strengthening both visual perception and language reasoning, mitigating visual hallucinations, and reducing reliance on language shortcuts.

<p align="center">
    <img src="https://github.com/zli12321/Vision-SR1/raw/main/assets/method.png" width="80%">
</p>

## Intended uses & limitations

This model is intended for research in Vision-Language Models, particularly for tasks benefiting from improved visual reasoning, mitigation of visual hallucinations, and reduced reliance on language shortcuts.

**Limitations:**
*   LLM evaluation scripts and model generation outputs with LLM judgments are currently in progress.

## Training and evaluation data

The training dataset used for Vision-SR1 is sourced from 23 sources and evenly split across three main areas: general visual understanding, science knowledge, and multimodal mathematical reasoning.

<p align="center">
    <img src="https://github.com/zli12321/Vision-SR1/raw/main/assets/data.png" width="80%">
</p>

Specific datasets constructed for Vision-SR1 training include:
*   [📊 Vision-SR1-Cold-Start-9K](https://huggingface.co/datasets/LMMs-Lab-Turtle/Vision-SR1-Cold-9K) (for Supervised Fine-Tuning, SFT)
*   [📊 Vision-SR1-47K](https://huggingface.co/datasets/LMMs-Lab-Turtle/Vision-SR1-47K) (for Reinforcement Learning, RL)

## Sample Usage

The following snippets are adopted directly from the [Vision-SR1 GitHub repository](https://github.com/zli12321/Vision-SR1) to demonstrate setup and training procedures.

### Requirements

```bash
git clone https://github.com/zli12321/Vision-SR1.git
cd Vision-SR1
conda create -n Vision-SR1 python=3.11
bash setup.sh
```

### GRPO Training

```bash
### Self-Reward Vision-SR1 GRPO Training
bash ./train_examples/2-7b_selfReward_train.sh

### Vision-SR1 regular training
bash ./train_examples/1-7b_visionR1_train.sh
```

### Merge checkpoints

```python
python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor
```

### Generating Evaluation Responses

```bash
bash ./validation_examples/2-seethink_format_eval.sh
```

### Supervised Finetuning Setup

The supervised finetuning code is adopted from [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) for easy setup.

```bash
conda create -n SFT python=3.11
cd LLaMA-Factory-Cold-Start
pip install -e ".[torch,metrics]" --no-build-isolation

pip install --upgrade huggingface_hub
huggingface-cli login
```

### Supervised Finetuning Training

```bash
FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/Vision-SR1-Cold-Start.yaml
```

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 2
- total_train_batch_size: 128
- total_eval_batch_size: 64
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 3.0

### Training results

The training concluded with the following overall results:
- epoch: 2.957983193277311
- total_flos: 92447203917824.0
- train_loss: 0.6085002004763501
- train_runtime: 1135.371
- train_samples_per_second: 20.124
- train_steps_per_second: 0.156

Reward progression during training:
![Reward Progression in training](https://github.com/zli12321/Vision-SR1/raw/main/assets/reward_progression.png)

### Framework versions

- Transformers 4.49.0
- Pytorch 2.7.1+cu126
- Datasets 3.6.0
- Tokenizers 0.21.1

## Citation

If you use this model or find our works helpful, please cite the original paper: [Self-Rewarding Vision-Language Model via Reasoning Decomposition](https://huggingface.co/papers/2508.19652).

We also recommend to cite the source code work [EasyR1](https://github.com/hiyouga/EasyR1):

```bibtex
@misc{zheng2025easyr1,
  title        = {EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework},
  author       = {Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong},
  howpublished = {\url{https://github.com/hiyouga/EasyR1}},
  year         = {2025}
}
```