license: apache-2.0 | |
pipeline_tag: image-text-to-text | |
library_name: transformers | |
# SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models | |
This model, VLAA-Thinker-Qwen2VL-7B, is a vision-language model fine-tuned on the VLAA-Thinking dataset. As described in [](https://huggingface.co/papers/2504.11468), it leverages a combination of supervised fine-tuning (SFT) and reinforcement learning (RL) to improve reasoning capabilities in LLMs. The model excels in multimodal reasoning tasks, achieving state-of-the-art performance on the OpenCompass Multimodal Reasoning Leaderboard as of April 7th, 2025. | |
<p align="center"> | |
🌐 <a href="https://ucsc-vlaa.github.io/VLAA-Thinking/" target="_blank">Project Page</a> | |
• <img src="./assets/ar.svg" alt="Arxiv Logo" style="height: 1em; vertical-align: middle; margin-right: 0.3em;"> | |
<a href="./assets/VLAA-Thinker.pdf" target="_blank">Arxiv</a> | |
• 💻 <a href="https://github.com/UCSC-VLAA/VLAA-Thinking" target="_blank">Code</a> | |
</p> | |
Both **VLAA-Thinker-Qwen2.5-3B** and **VLAA-Thinker-Qwen2.5-7B** achieve **SOTA** performance on [OpenCompass Multimodal Reasoning Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal-reasoning/?m=REALTIME) as of April 7th, 2025. | |
<img src="assets/opencompass_4b_box.png" width = "640" alt="pipeline" align=center /> | |
----- | |
<img src="assets/opencompass_7b_box.png" width = "640" alt="pipeline" align=center /> | |
## Quick Start 🚀 | |
### Inference | |
Run `python inference.py`. Note that our model is trained with a system prompt. Please ensure that it is included for inference. | |
### Dataset Download | |
Run `bash ./utils/download_dataset.sh`. Specify the dataset root with absolute path. The dataset should be ordered as follows: | |
``` | |
├── VLAA-Thinking-SFT-126K.json | |
├── VLAA-Thinking-GRPO-25K.json | |
└── images | |
├── allava_laion | |
├── arxivqa | |
├── chartqa | |
├── clevr_math | |
├── coco | |
│ └── train2017 | |
├── docvqa | |
├── geoqa170k | |
├── synthesis | |
├── vg | |
│ ├── VG_100K | |
│ └── VG_100K_2 | |
└── vizwiz | |
``` | |
### Training | |
Code coming soon! | |
(Rest of the README content can be kept as is) |