Improve model card introductory links and add project homepage
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,25 +1,31 @@
|
|
1 |
---
|
2 |
-
license: apache-2.0
|
3 |
-
pipeline_tag: image-text-to-text
|
4 |
-
library_name: transformers
|
5 |
base_model:
|
6 |
-
|
7 |
-
base_model_relation: finetune
|
8 |
datasets:
|
9 |
-
|
10 |
-
|
11 |
language:
|
12 |
-
|
|
|
|
|
|
|
13 |
tags:
|
14 |
-
|
15 |
-
|
|
|
16 |
---
|
17 |
|
18 |
# InternVL3_5-30B-A3B
|
19 |
|
20 |
-
|
21 |
|
22 |
-
[
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
<div align="center">
|
25 |
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
|
@@ -27,7 +33,7 @@ tags:
|
|
27 |
|
28 |
## Introduction
|
29 |
|
30 |
-
We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0
|
31 |
|
32 |

|
33 |
|
@@ -137,11 +143,11 @@ The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained
|
|
137 |
|
138 |
|
139 |
`InternVL3.5-Flash`:
|
140 |
-
Compared to InternVL3.5, InternVL3.5-Flash further integrates the *Visual Resolution Router (ViR)*, thus yielding a series of
|
141 |
Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM).
|
142 |
In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens.
|
143 |
For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly.
|
144 |
-
Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50
|
145 |
|
146 |
|
147 |

|
@@ -156,8 +162,8 @@ $$
|
|
156 |
\mathcal{L}_{i}=-\log p_\theta\left(x_i \mid x_1, \ldots, x_{i-1}\right),
|
157 |
$$
|
158 |
|
159 |
-
where \\(x_i\\) is the predicted token and
|
160 |
-
Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss
|
161 |
|
162 |
$$
|
163 |
\mathcal{L}_{i}^{'} = \frac{w_i}{\sum_j w_j} \cdot \mathcal{L}_i, \quad w_i = \frac{1}{N^{0.5}},
|
@@ -167,20 +173,20 @@ where \\(N\\) denotes the number of tokens in the training sample on which the l
|
|
167 |
|
168 |
### Supervised Fine-Tuning
|
169 |
|
170 |
-
During the SFT phase, we adopt the same objective as in the pre-training stage and use the
|
171 |
-
Compared to InternVL3, the SFT stage of InternVL3.5 contains
|
172 |
|
173 |
-
(1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks.
|
174 |
|
175 |
-
(2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks.
|
176 |
|
177 |
(3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect
|
178 |
|
179 |
### Cascade Reinforcement Learning
|
180 |
|
181 |
Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner.
|
182 |
-
Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage.
|
183 |
-
Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself.
|
184 |
|
185 |
|
186 |
|
@@ -257,7 +263,7 @@ y_i^\text{router} =
|
|
257 |
\end{cases}
|
258 |
$$
|
259 |
|
260 |
-
where \(y_i^{\text{router}}=0\) and \(y_i^{\text{router}}=1\)
|
261 |
|
262 |
> Please see [our paper](https://huggingface.co/papers/2508.18265) for more technical and experimental details.
|
263 |
|
@@ -278,7 +284,7 @@ This approach improves reasoning breadth.
|
|
278 |
|
279 |
### Decoupled Vision-Language Deployment
|
280 |
|
281 |
-
In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state.
|
282 |
When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images.
|
283 |
|
284 |

|
@@ -494,7 +500,9 @@ image_urls=[
|
|
494 |
|
495 |
images = [load_image(img_url) for img_url in image_urls]
|
496 |
# Numbering images improves multi-image conversations
|
497 |
-
response = pipe((f'Image-1: {IMAGE_TOKEN}
|
|
|
|
|
498 |
print(response.text)
|
499 |
```
|
500 |
|
@@ -597,3 +605,14 @@ If you find this project useful in your research, please consider citing:
|
|
597 |
year={2025}
|
598 |
}
|
599 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
|
|
|
|
|
|
2 |
base_model:
|
3 |
+
- OpenGVLab/InternVL3_5-30B-A3B-MPO
|
|
|
4 |
datasets:
|
5 |
+
- OpenGVLab/MMPR-v1.2
|
6 |
+
- OpenGVLab/MMPR-Tiny
|
7 |
language:
|
8 |
+
- multilingual
|
9 |
+
library_name: transformers
|
10 |
+
license: apache-2.0
|
11 |
+
pipeline_tag: image-text-to-text
|
12 |
tags:
|
13 |
+
- internvl
|
14 |
+
- custom_code
|
15 |
+
base_model_relation: finetune
|
16 |
---
|
17 |
|
18 |
# InternVL3_5-30B-A3B
|
19 |
|
20 |
+
This repository contains the InternVL3.5-30B-A3B model.
|
21 |
|
22 |
+
- **Paper**: [InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency](https://huggingface.co/papers/2508.18265)
|
23 |
+
- **Project Homepage**: [https://internvl.github.io/](https://internvl.github.io/)
|
24 |
+
- **Code**: [https://github.com/OpenGVLab/InternVL](https://github.com/OpenGVLab/InternVL)
|
25 |
+
- **Live Demo**: [https://chat.intern-ai.org.cn/](https://chat.intern-ai.org.cn/)
|
26 |
+
- **Documents**: [https://internvl.readthedocs.io/en/latest/](https://internvl.readthedocs.io/en/latest/)
|
27 |
+
|
28 |
+
Other InternVL papers: [InternVL 1.0](https://huggingface.co/papers/2312.14238), [InternVL 1.5](https://huggingface.co/papers/2404.16821), [InternVL 2.5](https://huggingface.co/papers/2412.05271), [InternVL2.5-MPO](https://huggingface.co/papers/2411.10442), [InternVL3](https://huggingface.co/papers/2504.10479)
|
29 |
|
30 |
<div align="center">
|
31 |
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
|
|
|
33 |
|
34 |
## Introduction
|
35 |
|
36 |
+
We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
|
37 |
|
38 |

|
39 |
|
|
|
143 |
|
144 |
|
145 |
`InternVL3.5-Flash`:
|
146 |
+
Compared to InternVL3.5, InternVL3.5-Flash further integrates the *Visual Resolution Router (ViR)*, thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios.
|
147 |
Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM).
|
148 |
In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens.
|
149 |
For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly.
|
150 |
+
Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50% while maintaining nearly 100% of the performance of InternVL3.5.
|
151 |
|
152 |
|
153 |

|
|
|
162 |
\mathcal{L}_{i}=-\log p_\theta\left(x_i \mid x_1, \ldots, x_{i-1}\right),
|
163 |
$$
|
164 |
|
165 |
+
where \\(x_i\\) is the predicted token and prefix tokens in \\(\{x_1, x_2, \ldots, x_{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss.
|
166 |
+
Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows:
|
167 |
|
168 |
$$
|
169 |
\mathcal{L}_{i}^{'} = \frac{w_i}{\sum_j w_j} \cdot \mathcal{L}_i, \quad w_i = \frac{1}{N^{0.5}},
|
|
|
173 |
|
174 |
### Supervised Fine-Tuning
|
175 |
|
176 |
+
During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information.
|
177 |
+
Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources:
|
178 |
|
179 |
+
(1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks.
|
180 |
|
181 |
+
(2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks.
|
182 |
|
183 |
(3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect
|
184 |
|
185 |
### Cascade Reinforcement Learning
|
186 |
|
187 |
Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner.
|
188 |
+
Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage.
|
189 |
+
Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost.
|
190 |
|
191 |
|
192 |
|
|
|
263 |
\end{cases}
|
264 |
$$
|
265 |
|
266 |
+
where \(y_i^{\text{router}}=0\) and \(y_i^{\text{router}}=1\) indicate that the compression rate \(\xi\) is set to \(\tfrac{1}{16}\) and \(\tfrac{1}{4}\), respectively.
|
267 |
|
268 |
> Please see [our paper](https://huggingface.co/papers/2508.18265) for more technical and experimental details.
|
269 |
|
|
|
284 |
|
285 |
### Decoupled Vision-Language Deployment
|
286 |
|
287 |
+
In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency.
|
288 |
When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images.
|
289 |
|
290 |

|
|
|
500 |
|
501 |
images = [load_image(img_url) for img_url in image_urls]
|
502 |
# Numbering images improves multi-image conversations
|
503 |
+
response = pipe((f'Image-1: {IMAGE_TOKEN}
|
504 |
+
Image-2: {IMAGE_TOKEN}
|
505 |
+
describe these two images', images))
|
506 |
print(response.text)
|
507 |
```
|
508 |
|
|
|
605 |
year={2025}
|
606 |
}
|
607 |
```
|
608 |
+
|
609 |
+
|
610 |
+
## Acknowledgement
|
611 |
+
|
612 |
+
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|
613 |
+
|
614 |
+
______________________________________________________________________
|
615 |
+
|
616 |
+
Scan the following QR Code, join our WeChat group.
|
617 |
+
|
618 |
+
<p align="center"><img width="300" alt="image" src="https://github.com/user-attachments/assets/f776df09-ebba-4fd5-80c2-fec4ff1518be"></p>
|