Improve model card: Add pipeline tag, paper link, abstract, code, and usage
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,34 +1,110 @@
|
|
1 |
---
|
|
|
2 |
library_name: transformers
|
3 |
license: other
|
4 |
-
base_model: Qwen/Qwen2.5-VL-3B-Instruct
|
5 |
tags:
|
6 |
- llama-factory
|
7 |
- full
|
8 |
- generated_from_trainer
|
|
|
9 |
model-index:
|
10 |
- name: Qwen2.5-VL-3B-Instruct
|
11 |
results: []
|
|
|
12 |
---
|
13 |
|
14 |
-
|
15 |
-
|
|
|
|
|
|
|
16 |
|
17 |
-
|
18 |
|
19 |
-
|
20 |
|
21 |
## Model description
|
22 |
|
23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
## Intended uses & limitations
|
26 |
|
27 |
-
|
|
|
|
|
|
|
28 |
|
29 |
## Training and evaluation data
|
30 |
|
31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
|
33 |
## Training procedure
|
34 |
|
@@ -51,7 +127,16 @@ The following hyperparameters were used during training:
|
|
51 |
|
52 |
### Training results
|
53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
54 |
|
|
|
|
|
55 |
|
56 |
### Framework versions
|
57 |
|
@@ -59,3 +144,18 @@ The following hyperparameters were used during training:
|
|
59 |
- Pytorch 2.7.1+cu126
|
60 |
- Datasets 3.6.0
|
61 |
- Tokenizers 0.21.1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
base_model: Qwen/Qwen2.5-VL-3B-Instruct
|
3 |
library_name: transformers
|
4 |
license: other
|
|
|
5 |
tags:
|
6 |
- llama-factory
|
7 |
- full
|
8 |
- generated_from_trainer
|
9 |
+
- vision-language-model
|
10 |
model-index:
|
11 |
- name: Qwen2.5-VL-3B-Instruct
|
12 |
results: []
|
13 |
+
pipeline_tag: image-text-to-text
|
14 |
---
|
15 |
|
16 |
+
# Qwen2.5-VL-3B-Instruct: Self-Rewarding Vision-Language Model via Reasoning Decomposition
|
17 |
+
|
18 |
+
This model is a fine-tuned version of [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) on the mllm_data1_cotOnly and the mllm_data1_description_val_text_only datasets. It was presented in the paper [Self-Rewarding Vision-Language Model via Reasoning Decomposition](https://huggingface.co/papers/2508.19652).
|
19 |
+
|
20 |
+
Code: https://github.com/zli12321/Vision-SR1
|
21 |
|
22 |
+
## Abstract
|
23 |
|
24 |
+
Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.
|
25 |
|
26 |
## Model description
|
27 |
|
28 |
+
Vision-SR1 is a self-rewarded Reinforcement Learning (RL) training framework that decomposes Vision-Language Models' (VLMs) language reasoning into visual perception reasoning and language reasoning. Inspired by works like Vision-R1, Visionary-R1, and R1-VL, Vision-SR1 leverages the VLM's self-evolving and reasoning ability to reward itself.
|
29 |
+
|
30 |
+
VLMs often rely primarily on language reasoning rather than visual perception because they fuse the vision encoder with the LLM backbone late in pretraining. Standard RL training can lead to recalling prior language knowledge for accuracy gains while neglecting vision. External LLM-based perception rewards can help but introduce bias and heavy latency. Vision-SR1 proposes a self-reward framework, enabling the model to provide its own visual and reasoning feedback with no latency, thereby strengthening both visual perception and language reasoning, mitigating visual hallucinations, and reducing reliance on language shortcuts.
|
31 |
+
|
32 |
+
<p align="center">
|
33 |
+
<img src="https://github.com/zli12321/Vision-SR1/raw/main/assets/method.png" width="80%">
|
34 |
+
</p>
|
35 |
|
36 |
## Intended uses & limitations
|
37 |
|
38 |
+
This model is intended for research in Vision-Language Models, particularly for tasks benefiting from improved visual reasoning, mitigation of visual hallucinations, and reduced reliance on language shortcuts.
|
39 |
+
|
40 |
+
**Limitations:**
|
41 |
+
* LLM evaluation scripts and model generation outputs with LLM judgments are currently in progress.
|
42 |
|
43 |
## Training and evaluation data
|
44 |
|
45 |
+
The training dataset used for Vision-SR1 is sourced from 23 sources and evenly split across three main areas: general visual understanding, science knowledge, and multimodal mathematical reasoning.
|
46 |
+
|
47 |
+
<p align="center">
|
48 |
+
<img src="https://github.com/zli12321/Vision-SR1/raw/main/assets/data.png" width="80%">
|
49 |
+
</p>
|
50 |
+
|
51 |
+
Specific datasets constructed for Vision-SR1 training include:
|
52 |
+
* [📊 Vision-SR1-Cold-Start-9K](https://huggingface.co/datasets/LMMs-Lab-Turtle/Vision-SR1-Cold-9K) (for Supervised Fine-Tuning, SFT)
|
53 |
+
* [📊 Vision-SR1-47K](https://huggingface.co/datasets/LMMs-Lab-Turtle/Vision-SR1-47K) (for Reinforcement Learning, RL)
|
54 |
+
|
55 |
+
## Sample Usage
|
56 |
+
|
57 |
+
The following snippets are adopted directly from the [Vision-SR1 GitHub repository](https://github.com/zli12321/Vision-SR1) to demonstrate setup and training procedures.
|
58 |
+
|
59 |
+
### Requirements
|
60 |
+
|
61 |
+
```bash
|
62 |
+
git clone https://github.com/zli12321/Vision-SR1.git
|
63 |
+
cd Vision-SR1
|
64 |
+
conda create -n Vision-SR1 python=3.11
|
65 |
+
bash setup.sh
|
66 |
+
```
|
67 |
+
|
68 |
+
### GRPO Training
|
69 |
+
|
70 |
+
```bash
|
71 |
+
### Self-Reward Vision-SR1 GRPO Training
|
72 |
+
bash ./train_examples/2-7b_selfReward_train.sh
|
73 |
+
|
74 |
+
### Vision-SR1 regular training
|
75 |
+
bash ./train_examples/1-7b_visionR1_train.sh
|
76 |
+
```
|
77 |
+
|
78 |
+
### Merge checkpoints
|
79 |
+
|
80 |
+
```python
|
81 |
+
python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor
|
82 |
+
```
|
83 |
+
|
84 |
+
### Generating Evaluation Responses
|
85 |
+
|
86 |
+
```bash
|
87 |
+
bash ./validation_examples/2-seethink_format_eval.sh
|
88 |
+
```
|
89 |
+
|
90 |
+
### Supervised Finetuning Setup
|
91 |
+
|
92 |
+
The supervised finetuning code is adopted from [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) for easy setup.
|
93 |
+
|
94 |
+
```bash
|
95 |
+
conda create -n SFT python=3.11
|
96 |
+
cd LLaMA-Factory-Cold-Start
|
97 |
+
pip install -e ".[torch,metrics]" --no-build-isolation
|
98 |
+
|
99 |
+
pip install --upgrade huggingface_hub
|
100 |
+
huggingface-cli login
|
101 |
+
```
|
102 |
+
|
103 |
+
### Supervised Finetuning Training
|
104 |
+
|
105 |
+
```bash
|
106 |
+
FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/Vision-SR1-Cold-Start.yaml
|
107 |
+
```
|
108 |
|
109 |
## Training procedure
|
110 |
|
|
|
127 |
|
128 |
### Training results
|
129 |
|
130 |
+
The training concluded with the following overall results:
|
131 |
+
- epoch: 2.957983193277311
|
132 |
+
- total_flos: 92447203917824.0
|
133 |
+
- train_loss: 0.6085002004763501
|
134 |
+
- train_runtime: 1135.371
|
135 |
+
- train_samples_per_second: 20.124
|
136 |
+
- train_steps_per_second: 0.156
|
137 |
|
138 |
+
Reward progression during training:
|
139 |
+

|
140 |
|
141 |
### Framework versions
|
142 |
|
|
|
144 |
- Pytorch 2.7.1+cu126
|
145 |
- Datasets 3.6.0
|
146 |
- Tokenizers 0.21.1
|
147 |
+
|
148 |
+
## Citation
|
149 |
+
|
150 |
+
If you use this model or find our works helpful, please cite the original paper: [Self-Rewarding Vision-Language Model via Reasoning Decomposition](https://huggingface.co/papers/2508.19652).
|
151 |
+
|
152 |
+
We also recommend to cite the source code work [EasyR1](https://github.com/hiyouga/EasyR1):
|
153 |
+
|
154 |
+
```bibtex
|
155 |
+
@misc{zheng2025easyr1,
|
156 |
+
title = {EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework},
|
157 |
+
author = {Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong},
|
158 |
+
howpublished = {\url{https://github.com/hiyouga/EasyR1}},
|
159 |
+
year = {2025}
|
160 |
+
}
|
161 |
+
```
|