Add more descriptive tags to model card
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,18 +1,24 @@
|
|
1 |
---
|
2 |
-
license: apache-2.0
|
3 |
-
pipeline_tag: image-text-to-text
|
4 |
-
library_name: transformers
|
5 |
base_model:
|
6 |
-
|
7 |
-
base_model_relation: finetune
|
8 |
datasets:
|
9 |
-
|
10 |
-
|
11 |
language:
|
12 |
-
|
|
|
|
|
|
|
13 |
tags:
|
14 |
-
|
15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
---
|
17 |
|
18 |
# InternVL3_5-4B-Instruct
|
@@ -27,7 +33,7 @@ tags:
|
|
27 |
|
28 |
## Introduction
|
29 |
|
30 |
-
We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0
|
31 |
|
32 |

|
33 |
|
@@ -141,7 +147,7 @@ Compared to InternVL3.5, InternVL3.5-Flash further integrates the *Visual Resolu
|
|
141 |
Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM).
|
142 |
In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens.
|
143 |
For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly.
|
144 |
-
Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50
|
145 |
|
146 |
|
147 |

|
@@ -233,7 +239,7 @@ $$
|
|
233 |
\Bigg],
|
234 |
$$
|
235 |
|
236 |
-
where \\(\mathrm{KL}
|
237 |
|
238 |
|
239 |
`Router training`:
|
@@ -529,40 +535,50 @@ generation_config = dict(max_new_tokens=1024, do_sample=True)
|
|
529 |
# pure-text conversation (纯文本对话)
|
530 |
question = 'Hello, who are you?'
|
531 |
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
|
532 |
-
print(f'User: {question}
|
|
|
533 |
|
534 |
question = 'Can you tell me a story?'
|
535 |
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
|
536 |
-
print(f'User: {question}
|
|
|
537 |
|
538 |
# single-image single-round conversation (单图单轮对话)
|
539 |
-
question = '<image
|
|
|
540 |
response = model.chat(tokenizer, pixel_values, question, generation_config)
|
541 |
-
print(f'User: {question}
|
|
|
542 |
|
543 |
# single-image multi-round conversation (单图多轮对话)
|
544 |
-
question = '<image
|
|
|
545 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
|
546 |
-
print(f'User: {question}
|
|
|
547 |
|
548 |
question = 'Please write a poem according to the image.'
|
549 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
|
550 |
-
print(f'User: {question}
|
|
|
551 |
|
552 |
# multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
|
553 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
554 |
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
|
555 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
556 |
|
557 |
-
question = '<image
|
|
|
558 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
559 |
history=None, return_history=True)
|
560 |
-
print(f'User: {question}
|
|
|
561 |
|
562 |
question = 'What are the similarities and differences between these two images.'
|
563 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
564 |
history=history, return_history=True)
|
565 |
-
print(f'User: {question}
|
|
|
566 |
|
567 |
# multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
|
568 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
@@ -570,17 +586,21 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
|
|
570 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
571 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
572 |
|
573 |
-
question = 'Image-1: <image
|
|
|
|
|
574 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
575 |
num_patches_list=num_patches_list,
|
576 |
history=None, return_history=True)
|
577 |
-
print(f'User: {question}
|
|
|
578 |
|
579 |
question = 'What are the similarities and differences between these two images.'
|
580 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
581 |
num_patches_list=num_patches_list,
|
582 |
history=history, return_history=True)
|
583 |
-
print(f'User: {question}
|
|
|
584 |
|
585 |
# batch inference, single image per sample (单图批处理)
|
586 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
@@ -588,13 +608,15 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
|
|
588 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
589 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
590 |
|
591 |
-
questions = ['<image
|
|
|
592 |
responses = model.batch_chat(tokenizer, pixel_values,
|
593 |
num_patches_list=num_patches_list,
|
594 |
questions=questions,
|
595 |
generation_config=generation_config)
|
596 |
for question, response in zip(questions, responses):
|
597 |
-
print(f'User: {question}
|
|
|
598 |
|
599 |
# video multi-round conversation (视频多轮对话)
|
600 |
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
|
@@ -632,17 +654,24 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
|
|
632 |
video_path = './examples/red-panda.mp4'
|
633 |
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
|
634 |
pixel_values = pixel_values.to(torch.bfloat16).cuda()
|
635 |
-
video_prefix = ''.join([f'Frame{i+1}: <image
|
|
|
636 |
question = video_prefix + 'What is the red panda doing?'
|
637 |
-
# Frame1: <image
|
|
|
|
|
|
|
|
|
638 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
639 |
num_patches_list=num_patches_list, history=None, return_history=True)
|
640 |
-
print(f'User: {question}
|
|
|
641 |
|
642 |
question = 'Describe this video in detail.'
|
643 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
644 |
num_patches_list=num_patches_list, history=history, return_history=True)
|
645 |
-
print(f'User: {question}
|
|
|
646 |
```
|
647 |
|
648 |
#### Streaming Output
|
@@ -726,7 +755,9 @@ image_urls=[
|
|
726 |
|
727 |
images = [load_image(img_url) for img_url in image_urls]
|
728 |
# Numbering images improves multi-image conversations
|
729 |
-
response = pipe((f'Image-1: {IMAGE_TOKEN}
|
|
|
|
|
730 |
print(response.text)
|
731 |
```
|
732 |
|
@@ -829,3 +860,14 @@ If you find this project useful in your research, please consider citing:
|
|
829 |
year={2025}
|
830 |
}
|
831 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
|
|
|
|
|
|
2 |
base_model:
|
3 |
+
- OpenGVLab/InternVL3_5-4B-Pretrained
|
|
|
4 |
datasets:
|
5 |
+
- OpenGVLab/MMPR-v1.2
|
6 |
+
- OpenGVLab/MMPR-Tiny
|
7 |
language:
|
8 |
+
- multilingual
|
9 |
+
library_name: transformers
|
10 |
+
license: apache-2.0
|
11 |
+
pipeline_tag: image-text-to-text
|
12 |
tags:
|
13 |
+
- internvl
|
14 |
+
- custom_code
|
15 |
+
- multimodal
|
16 |
+
- vision-language-model
|
17 |
+
- reasoning
|
18 |
+
- agentic
|
19 |
+
- multilingual
|
20 |
+
- qwen
|
21 |
+
base_model_relation: finetune
|
22 |
---
|
23 |
|
24 |
# InternVL3_5-4B-Instruct
|
|
|
33 |
|
34 |
## Introduction
|
35 |
|
36 |
+
We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
|
37 |
|
38 |

|
39 |
|
|
|
147 |
Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM).
|
148 |
In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens.
|
149 |
For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly.
|
150 |
+
Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50% while maintaining nearly 100% of the performance of InternVL3.5.
|
151 |
|
152 |
|
153 |

|
|
|
239 |
\Bigg],
|
240 |
$$
|
241 |
|
242 |
+
where \\(\mathrm{KL}\\) denotes the KL divergence and \(\xi\) denotes the compression rate, which is uniformly sampled from \(\{\frac{1}{4},\frac{1}{16}\}\). The image \(I_\xi\) is represented as 256 tokens when \(\xi=\frac{1}{4}\) and 64 tokens when \(\xi=\frac{1}{16}\). Notably, the reference model always performs inference with \(\xi=\frac{1}{4}\).
|
243 |
|
244 |
|
245 |
`Router training`:
|
|
|
535 |
# pure-text conversation (纯文本对话)
|
536 |
question = 'Hello, who are you?'
|
537 |
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
|
538 |
+
print(f'User: {question}
|
539 |
+
Assistant: {response}')
|
540 |
|
541 |
question = 'Can you tell me a story?'
|
542 |
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
|
543 |
+
print(f'User: {question}
|
544 |
+
Assistant: {response}')
|
545 |
|
546 |
# single-image single-round conversation (单图单轮对话)
|
547 |
+
question = '<image>
|
548 |
+
Please describe the image shortly.'
|
549 |
response = model.chat(tokenizer, pixel_values, question, generation_config)
|
550 |
+
print(f'User: {question}
|
551 |
+
Assistant: {response}')
|
552 |
|
553 |
# single-image multi-round conversation (单图多轮对话)
|
554 |
+
question = '<image>
|
555 |
+
Please describe the image in detail.'
|
556 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
|
557 |
+
print(f'User: {question}
|
558 |
+
Assistant: {response}')
|
559 |
|
560 |
question = 'Please write a poem according to the image.'
|
561 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
|
562 |
+
print(f'User: {question}
|
563 |
+
Assistant: {response}')
|
564 |
|
565 |
# multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
|
566 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
567 |
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
|
568 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
569 |
|
570 |
+
question = '<image>
|
571 |
+
Describe the two images in detail.'
|
572 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
573 |
history=None, return_history=True)
|
574 |
+
print(f'User: {question}
|
575 |
+
Assistant: {response}')
|
576 |
|
577 |
question = 'What are the similarities and differences between these two images.'
|
578 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
579 |
history=history, return_history=True)
|
580 |
+
print(f'User: {question}
|
581 |
+
Assistant: {response}')
|
582 |
|
583 |
# multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
|
584 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
|
|
586 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
587 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
588 |
|
589 |
+
question = 'Image-1: <image>
|
590 |
+
Image-2: <image>
|
591 |
+
Describe the two images in detail.'
|
592 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
593 |
num_patches_list=num_patches_list,
|
594 |
history=None, return_history=True)
|
595 |
+
print(f'User: {question}
|
596 |
+
Assistant: {response}')
|
597 |
|
598 |
question = 'What are the similarities and differences between these two images.'
|
599 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
600 |
num_patches_list=num_patches_list,
|
601 |
history=history, return_history=True)
|
602 |
+
print(f'User: {question}
|
603 |
+
Assistant: {response}')
|
604 |
|
605 |
# batch inference, single image per sample (单图批处理)
|
606 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
|
|
608 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
609 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
610 |
|
611 |
+
questions = ['<image>
|
612 |
+
Describe the image in detail.'] * len(num_patches_list)
|
613 |
responses = model.batch_chat(tokenizer, pixel_values,
|
614 |
num_patches_list=num_patches_list,
|
615 |
questions=questions,
|
616 |
generation_config=generation_config)
|
617 |
for question, response in zip(questions, responses):
|
618 |
+
print(f'User: {question}
|
619 |
+
Assistant: {response}')
|
620 |
|
621 |
# video multi-round conversation (视频多轮对话)
|
622 |
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
|
|
|
654 |
video_path = './examples/red-panda.mp4'
|
655 |
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
|
656 |
pixel_values = pixel_values.to(torch.bfloat16).cuda()
|
657 |
+
video_prefix = ''.join([f'Frame{i+1}: <image>
|
658 |
+
' for i in range(len(num_patches_list))])
|
659 |
question = video_prefix + 'What is the red panda doing?'
|
660 |
+
# Frame1: <image>
|
661 |
+
Frame2: <image>
|
662 |
+
...
|
663 |
+
Frame8: <image>
|
664 |
+
{question}
|
665 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
666 |
num_patches_list=num_patches_list, history=None, return_history=True)
|
667 |
+
print(f'User: {question}
|
668 |
+
Assistant: {response}')
|
669 |
|
670 |
question = 'Describe this video in detail.'
|
671 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
672 |
num_patches_list=num_patches_list, history=history, return_history=True)
|
673 |
+
print(f'User: {question}
|
674 |
+
Assistant: {response}')
|
675 |
```
|
676 |
|
677 |
#### Streaming Output
|
|
|
755 |
|
756 |
images = [load_image(img_url) for img_url in image_urls]
|
757 |
# Numbering images improves multi-image conversations
|
758 |
+
response = pipe((f'Image-1: {IMAGE_TOKEN}
|
759 |
+
Image-2: {IMAGE_TOKEN}
|
760 |
+
describe these two images', images))
|
761 |
print(response.text)
|
762 |
```
|
763 |
|
|
|
860 |
year={2025}
|
861 |
}
|
862 |
```
|
863 |
+
|
864 |
+
|
865 |
+
## Acknowledgement
|
866 |
+
|
867 |
+
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|
868 |
+
|
869 |
+
______________________________________________________________________
|
870 |
+
|
871 |
+
Scan the following QR Code, join our WeChat group.
|
872 |
+
|
873 |
+
<p align="center"><img width="300" alt="image" src="https://github.com/user-attachments/assets/f776df09-ebba-4fd5-80c2-fec4ff1518be"></p>
|