Improve model card: Add pipeline tag, paper link, abstract, code, and usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +108 -8
README.md CHANGED
@@ -1,34 +1,110 @@
1
  ---
 
2
  library_name: transformers
3
  license: other
4
- base_model: Qwen/Qwen2.5-VL-3B-Instruct
5
  tags:
6
  - llama-factory
7
  - full
8
  - generated_from_trainer
 
9
  model-index:
10
  - name: Qwen2.5-VL-3B-Instruct
11
  results: []
 
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
 
 
 
16
 
17
- # Qwen2.5-VL-3B-Instruct
18
 
19
- This model is a fine-tuned version of [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) on the mllm_data1_cotOnly and the mllm_data1_description_val_text_only datasets.
20
 
21
  ## Model description
22
 
23
- More information needed
 
 
 
 
 
 
24
 
25
  ## Intended uses & limitations
26
 
27
- More information needed
 
 
 
28
 
29
  ## Training and evaluation data
30
 
31
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ## Training procedure
34
 
@@ -51,7 +127,16 @@ The following hyperparameters were used during training:
51
 
52
  ### Training results
53
 
 
 
 
 
 
 
 
54
 
 
 
55
 
56
  ### Framework versions
57
 
@@ -59,3 +144,18 @@ The following hyperparameters were used during training:
59
  - Pytorch 2.7.1+cu126
60
  - Datasets 3.6.0
61
  - Tokenizers 0.21.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: Qwen/Qwen2.5-VL-3B-Instruct
3
  library_name: transformers
4
  license: other
 
5
  tags:
6
  - llama-factory
7
  - full
8
  - generated_from_trainer
9
+ - vision-language-model
10
  model-index:
11
  - name: Qwen2.5-VL-3B-Instruct
12
  results: []
13
+ pipeline_tag: image-text-to-text
14
  ---
15
 
16
+ # Qwen2.5-VL-3B-Instruct: Self-Rewarding Vision-Language Model via Reasoning Decomposition
17
+
18
+ This model is a fine-tuned version of [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) on the mllm_data1_cotOnly and the mllm_data1_description_val_text_only datasets. It was presented in the paper [Self-Rewarding Vision-Language Model via Reasoning Decomposition](https://huggingface.co/papers/2508.19652).
19
+
20
+ Code: https://github.com/zli12321/Vision-SR1
21
 
22
+ ## Abstract
23
 
24
+ Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.
25
 
26
  ## Model description
27
 
28
+ Vision-SR1 is a self-rewarded Reinforcement Learning (RL) training framework that decomposes Vision-Language Models' (VLMs) language reasoning into visual perception reasoning and language reasoning. Inspired by works like Vision-R1, Visionary-R1, and R1-VL, Vision-SR1 leverages the VLM's self-evolving and reasoning ability to reward itself.
29
+
30
+ VLMs often rely primarily on language reasoning rather than visual perception because they fuse the vision encoder with the LLM backbone late in pretraining. Standard RL training can lead to recalling prior language knowledge for accuracy gains while neglecting vision. External LLM-based perception rewards can help but introduce bias and heavy latency. Vision-SR1 proposes a self-reward framework, enabling the model to provide its own visual and reasoning feedback with no latency, thereby strengthening both visual perception and language reasoning, mitigating visual hallucinations, and reducing reliance on language shortcuts.
31
+
32
+ <p align="center">
33
+ <img src="https://github.com/zli12321/Vision-SR1/raw/main/assets/method.png" width="80%">
34
+ </p>
35
 
36
  ## Intended uses & limitations
37
 
38
+ This model is intended for research in Vision-Language Models, particularly for tasks benefiting from improved visual reasoning, mitigation of visual hallucinations, and reduced reliance on language shortcuts.
39
+
40
+ **Limitations:**
41
+ * LLM evaluation scripts and model generation outputs with LLM judgments are currently in progress.
42
 
43
  ## Training and evaluation data
44
 
45
+ The training dataset used for Vision-SR1 is sourced from 23 sources and evenly split across three main areas: general visual understanding, science knowledge, and multimodal mathematical reasoning.
46
+
47
+ <p align="center">
48
+ <img src="https://github.com/zli12321/Vision-SR1/raw/main/assets/data.png" width="80%">
49
+ </p>
50
+
51
+ Specific datasets constructed for Vision-SR1 training include:
52
+ * [📊 Vision-SR1-Cold-Start-9K](https://huggingface.co/datasets/LMMs-Lab-Turtle/Vision-SR1-Cold-9K) (for Supervised Fine-Tuning, SFT)
53
+ * [📊 Vision-SR1-47K](https://huggingface.co/datasets/LMMs-Lab-Turtle/Vision-SR1-47K) (for Reinforcement Learning, RL)
54
+
55
+ ## Sample Usage
56
+
57
+ The following snippets are adopted directly from the [Vision-SR1 GitHub repository](https://github.com/zli12321/Vision-SR1) to demonstrate setup and training procedures.
58
+
59
+ ### Requirements
60
+
61
+ ```bash
62
+ git clone https://github.com/zli12321/Vision-SR1.git
63
+ cd Vision-SR1
64
+ conda create -n Vision-SR1 python=3.11
65
+ bash setup.sh
66
+ ```
67
+
68
+ ### GRPO Training
69
+
70
+ ```bash
71
+ ### Self-Reward Vision-SR1 GRPO Training
72
+ bash ./train_examples/2-7b_selfReward_train.sh
73
+
74
+ ### Vision-SR1 regular training
75
+ bash ./train_examples/1-7b_visionR1_train.sh
76
+ ```
77
+
78
+ ### Merge checkpoints
79
+
80
+ ```python
81
+ python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor
82
+ ```
83
+
84
+ ### Generating Evaluation Responses
85
+
86
+ ```bash
87
+ bash ./validation_examples/2-seethink_format_eval.sh
88
+ ```
89
+
90
+ ### Supervised Finetuning Setup
91
+
92
+ The supervised finetuning code is adopted from [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) for easy setup.
93
+
94
+ ```bash
95
+ conda create -n SFT python=3.11
96
+ cd LLaMA-Factory-Cold-Start
97
+ pip install -e ".[torch,metrics]" --no-build-isolation
98
+
99
+ pip install --upgrade huggingface_hub
100
+ huggingface-cli login
101
+ ```
102
+
103
+ ### Supervised Finetuning Training
104
+
105
+ ```bash
106
+ FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/Vision-SR1-Cold-Start.yaml
107
+ ```
108
 
109
  ## Training procedure
110
 
 
127
 
128
  ### Training results
129
 
130
+ The training concluded with the following overall results:
131
+ - epoch: 2.957983193277311
132
+ - total_flos: 92447203917824.0
133
+ - train_loss: 0.6085002004763501
134
+ - train_runtime: 1135.371
135
+ - train_samples_per_second: 20.124
136
+ - train_steps_per_second: 0.156
137
 
138
+ Reward progression during training:
139
+ ![Reward Progression in training](https://github.com/zli12321/Vision-SR1/raw/main/assets/reward_progression.png)
140
 
141
  ### Framework versions
142
 
 
144
  - Pytorch 2.7.1+cu126
145
  - Datasets 3.6.0
146
  - Tokenizers 0.21.1
147
+
148
+ ## Citation
149
+
150
+ If you use this model or find our works helpful, please cite the original paper: [Self-Rewarding Vision-Language Model via Reasoning Decomposition](https://huggingface.co/papers/2508.19652).
151
+
152
+ We also recommend to cite the source code work [EasyR1](https://github.com/hiyouga/EasyR1):
153
+
154
+ ```bibtex
155
+ @misc{zheng2025easyr1,
156
+ title = {EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework},
157
+ author = {Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong},
158
+ howpublished = {\url{https://github.com/hiyouga/EasyR1}},
159
+ year = {2025}
160
+ }
161
+ ```