nielsr HF Staff commited on
Commit
411d92d
·
verified ·
1 Parent(s): 52cbe46

Update paper link and enrich model card content

Browse files

This PR updates the primary paper link to [Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models](https://huggingface.co/papers/2507.12566) and enhances the model card content with a detailed introduction, key highlights, performance overview, and a brief usage example, drawn from the paper abstract and the project's GitHub README.

Files changed (1) hide show
  1. README.md +85 -16
README.md CHANGED
@@ -1,38 +1,107 @@
1
  ---
2
- license: mit
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
  base_model:
6
- - internlm/internlm2-chat-1_8b
7
- base_model_relation: merge
8
  language:
9
- - multilingual
 
 
 
10
  tags:
11
- - internvl
12
- - vision
13
- - ocr
14
- - custom_code
15
- - moe
 
16
  ---
17
 
18
  # Mono-InternVL-2B-S1-1
19
 
20
- This repository contains the Mono-InternVL-2B model after **S1.1 concept learning**.
 
 
 
 
 
 
 
 
 
 
21
 
22
- Please refer to our [**paper**](https://huggingface.co/papers/2410.08202), [**project page**](https://internvl.github.io/blog/2024-10-10-Mono-InternVL/) and [**GitHub repository**](https://github.com/OpenGVLab/mono-internvl) for introduction and usage.
23
 
 
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  ## Citation
27
 
28
- If you find this project useful in your research, please consider citing:
29
 
30
  ```BibTeX
31
- @article{luo2024mono,
32
  title={Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training},
33
  author={Luo, Gen and Yang, Xue and Dou, Wenhan and Wang, Zhaokai and Liu, Jiawen and Dai, Jifeng and Qiao, Yu and Zhu, Xizhou},
34
  journal={arXiv preprint arXiv:2410.08202},
35
  year={2024}
36
  }
37
- ```
38
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
+ - internlm/internlm2-chat-1_8b
 
4
  language:
5
+ - multilingual
6
+ library_name: transformers
7
+ license: mit
8
+ pipeline_tag: image-text-to-text
9
  tags:
10
+ - internvl
11
+ - vision
12
+ - ocr
13
+ - custom_code
14
+ - moe
15
+ base_model_relation: merge
16
  ---
17
 
18
  # Mono-InternVL-2B-S1-1
19
 
20
+ This repository contains the Mono-InternVL-2B model after **S1.1 concept learning**, as part of the work presented in [Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models](https://huggingface.co/papers/2507.12566).
21
+
22
+ Please refer to our [**project page**](https://internvl.github.io/blog/2024-10-10-Mono-InternVL/) and [**GitHub repository**](https://github.com/OpenGVLab/mono-internvl) for full introduction, code, and usage instructions.
23
+
24
+ **Mono-InternVL** is a family of monolithic multimodal large language models (MLLMs) that integrates visual encoding and language decoding into a single LLM, aiming for cheaper and faster inference. It addresses challenges of unstable optimization and catastrophic forgetting by embedding a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge via delta tuning.
25
+
26
+ ### ✨ Key Highlights
27
+
28
+ - **Monolithic Architecture**: Integrates visual encoding and language decoding into a single LLM, simplifying the model structure.
29
+ - **Endogenous Visual Pre-training (EViP++)**: Features an innovative pre-training strategy that maximizes visual capabilities through progressive learning and incorporates additional visual attention experts.
30
+ - **Efficiency**: Significantly reduces training and inference costs, including a fused CUDA kernel for faster MoE operations, while maintaining competitive performance.
31
 
32
+ ### 📊 Performance
33
 
34
+ Mono-InternVL achieves competitive performance across various multimodal benchmarks, often outperforming other monolithic MLLMs. Compared to its modular counterpart, InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%.
35
 
36
+ Below is a summary of some key benchmarks:
37
+
38
+ | Benchmark | Mono-InternVL-2B | Mini-InternVL-2B-1-5 | Emu3 |
39
+ | :------------------- | :--------------: | :------------------: | :---: |
40
+ | Type | Monolithic | Modular | Monolithic |
41
+ | #Activated Params | 1.8B | 2.2B | 8B |
42
+ | **MMVet** | 40.1 | 39.3 | 37.2 |
43
+ | **OCRBench** | 767 | 654 | 687 |
44
+ | **MathVista** | 45.7 | 41.1 | — |
45
+ | **TextVQA** | 72.6 | 70.5 | 64.7 |
46
+ | **DocVQA** | 80.0 | 85.0 | 76.3 |
47
+
48
+ *(For full performance details, please refer to the [paper](https://huggingface.co/papers/2507.12566) and [project page](https://internvl.github.io/blog/2024-10-10-Mono-InternVL/))*
49
+
50
+ ### 🚀 Quick Inference (using Transformers)
51
+
52
+ ```python
53
+ import torch
54
+ from PIL import Image
55
+ from transformers import AutoModel, AutoTokenizer
56
+
57
+ # Load model and tokenizer (ensure transformers==4.37.2)
58
+ path = 'OpenGVLab/Mono-InternVL-2B'
59
+ model = AutoModel.from_pretrained(
60
+ path,
61
+ torch_dtype=torch.bfloat16,
62
+ low_cpu_mem_usage=True,
63
+ trust_remote_code=True
64
+ ).eval().cuda()
65
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
66
+
67
+ # Load image (ensure image is preprocessed if needed as per GitHub instructions)
68
+ # For simplicity, using a dummy image path here.
69
+ # Refer to the GitHub repo for `load_image` utility function.
70
+ # pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
71
+ pixel_values = None # Replace with actual image tensor
72
+
73
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
74
+
75
+ # Example: single-image single-round conversation
76
+ question = '<image>
77
+ Please describe the image shortly.'
78
+ # response = model.chat(tokenizer, pixel_values, question, generation_config)
79
+ # print(f'User: {question}
80
+ Assistant: {response}')
81
+
82
+ # Example: pure-text conversation
83
+ question = 'Hello, who are you?'
84
+ response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
85
+ print(f'User: {question}
86
+ Assistant: {response}')
87
+ ```
88
 
89
  ## Citation
90
 
91
+ If you find this project useful in your research, please consider citing the related papers:
92
 
93
  ```BibTeX
94
+ @article{mono_internvl_v1,
95
  title={Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training},
96
  author={Luo, Gen and Yang, Xue and Dou, Wenhan and Wang, Zhaokai and Liu, Jiawen and Dai, Jifeng and Qiao, Yu and Zhu, Xizhou},
97
  journal={arXiv preprint arXiv:2410.08202},
98
  year={2024}
99
  }
 
100
 
101
+ @article{mono_internvl_v1.5,
102
+ title={Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models},
103
+ author={Luo, Gen and Dou, Wenhan and Li, Wenhao and Wang, Zhaokai and Yang, Xue and Tian, Changyao and Li, Hao and Wang, Weiyun and Wang, Wenhai and Zhu, Xizhou and Qiao, Yu and Dai, Jifeng},
104
+ journal={arXiv preprint arXiv:2507.12566},
105
+ year={2025}
106
+ }
107
+ ```