Improve model card with sample usage and enhanced tags

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +140 -62
README.md CHANGED
@@ -1,62 +1,140 @@
1
- ---
2
- license: mit
3
- language:
4
- - en
5
- - zh
6
- base_model:
7
- - THUDM/GLM-4-9B-0414
8
- pipeline_tag: image-text-to-text
9
- library_name: transformers
10
- tags:
11
- - reasoning
12
- ---
13
-
14
- # GLM-4.1V-9B-Base
15
-
16
- <div align="center">
17
- <img src=https://raw.githubusercontent.com/THUDM/GLM-4.1V-Thinking/99c5eb6563236f0ff43605d91d107544da9863b2/resources/logo.svg width="40%"/>
18
- </div>
19
- <p align="center">
20
- 📖 View the GLM-4.1V-9B-Thinking <a href="https://arxiv.org/abs/2507.01006" target="_blank">paper</a>.
21
- <br>
22
- 💡 Try the <a href="https://huggingface.co/spaces/THUDM/GLM-4.1V-9B-Thinking-Demo" target="_blank">Hugging Face</a> or <a href="https://modelscope.cn/studios/ZhipuAI/GLM-4.1V-9B-Thinking-Demo" target="_blank">ModelScope</a> online demo for GLM-4.1V-9B-Thinking.
23
- <br>
24
- 📍 Using GLM-4.1V-9B-Thinking API at <a href="https://www.bigmodel.cn/dev/api/visual-reasoning-model/GLM-4.1V-Thinking">Zhipu Foundation Model Open Platform</a>
25
- </p>
26
-
27
- ## Model Introduction
28
-
29
- Vision-Language Models (VLMs) have become foundational components of intelligent systems. As real-world AI tasks grow
30
- increasingly complex, VLMs must evolve beyond basic multimodal perception to enhance their reasoning capabilities in
31
- complex tasks. This involves improving accuracy, comprehensiveness, and intelligence, enabling applications such as
32
- complex problem solving, long-context understanding, and multimodal agents.
33
-
34
- Based on the [GLM-4-9B-0414](https://github.com/THUDM/GLM-4) foundation model, we present the new open-source VLM model
35
- **GLM-4.1V-9B-Thinking**, designed to explore the upper limits of reasoning in vision-language models. By introducing
36
- a "thinking paradigm" and leveraging reinforcement learning, the model significantly enhances its capabilities. It
37
- achieves state-of-the-art performance among 10B-parameter VLMs, matching or even surpassing the 72B-parameter
38
- Qwen-2.5-VL-72B on 18 benchmark tasks. We are also open-sourcing the base model GLM-4.1V-9B-Base to
39
- support further research into the boundaries of VLM capabilities.
40
-
41
- ![rl](https://raw.githubusercontent.com/THUDM/GLM-4.1V-Thinking/refs/heads/main/resources/rl.jpeg)
42
-
43
- Compared to the previous generation models CogVLM2 and the GLM-4V series, **GLM-4.1V-Thinking** offers the
44
- following improvements:
45
-
46
- 1. The first reasoning-focused model in the series, achieving world-leading performance not only in mathematics but also
47
- across various sub-domains.
48
- 2. Supports **64k** context length.
49
- 3. Handles **arbitrary aspect ratios** and up to **4K** image resolution.
50
- 4. Provides an open-source version supporting both **Chinese and English bilingual** usage.
51
-
52
- ## Benchmark Performance
53
-
54
- By incorporating the Chain-of-Thought reasoning paradigm, GLM-4.1V-9B-Thinking significantly improves answer accuracy,
55
- richness, and interpretability. It comprehensively surpasses traditional non-reasoning visual models.
56
- Out of 28 benchmark tasks, it achieved the best performance among 10B-level models on 23 tasks,
57
- and even outperformed the 72B-parameter Qwen-2.5-VL-72B on 18 tasks.
58
-
59
- ![bench](https://raw.githubusercontent.com/THUDM/GLM-4.1V-Thinking/refs/heads/main/resources/bench.jpeg)
60
-
61
-
62
- For video reasoning, web demo deployment, and more code, please check our [GitHub](https://github.com/THUDM/GLM-4.1V-Thinking).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - THUDM/GLM-4-9B-0414
4
+ language:
5
+ - en
6
+ - zh
7
+ library_name: transformers
8
+ license: mit
9
+ pipeline_tag: image-text-to-text
10
+ tags:
11
+ - reasoning
12
+ - multimodal
13
+ - video
14
+ - gui-agent
15
+ - grounding
16
+ ---
17
+
18
+ # GLM-4.1V-9B-Base
19
+
20
+ <div align="center">
21
+ <img src=https://raw.githubusercontent.com/THUDM/GLM-4.1V-Thinking/99c5eb6563236f0ff43605d91d107544da9863b2/resources/logo.svg width="40%"/>
22
+ </div>
23
+ <p align="center">
24
+ 📖 View the GLM-4.1V-9B-Thinking <a href="https://arxiv.org/abs/2507.01006" target="_blank">paper</a>.
25
+ <br>
26
+ 💡 Try the <a href="https://huggingface.co/spaces/THUDM/GLM-4.1V-9B-Thinking-Demo" target="_blank">Hugging Face</a> or <a href="https://modelscope.cn/studios/ZhipuAI/GLM-4.1V-9B-Thinking-Demo" target="_blank">ModelScope</a> online demo for GLM-4.1V-9B-Thinking.
27
+ <br>
28
+ 📍 Using GLM-4.1V-9B-Thinking API at <a href="https://www.bigmodel.cn/dev/api/visual-reasoning-model/GLM-4.1V-Thinking">Zhipu Foundation Model Open Platform</a>
29
+ </p>
30
+
31
+ ## Model Introduction
32
+
33
+ Vision-Language Models (VLMs) have become foundational components of intelligent systems. As real-world AI tasks grow
34
+ increasingly complex, VLMs must evolve beyond basic multimodal perception to enhance their reasoning capabilities in
35
+ complex tasks. This involves improving accuracy, comprehensiveness, and intelligence, enabling applications such as
36
+ complex problem solving, long-context understanding, and multimodal agents.
37
+
38
+ Based on the [GLM-4-9B-0414](https://github.com/THUDM/GLM-4) foundation model, we present the new open-source VLM model
39
+ **GLM-4.1V-9B-Thinking**, designed to explore the upper limits of reasoning in vision-language models. By introducing
40
+ a "thinking paradigm" and leveraging reinforcement learning, the model significantly enhances its capabilities. It
41
+ achieves state-of-the-art performance among 10B-parameter VLMs, matching or even surpassing the 72B-parameter
42
+ Qwen-2.5-VL-72B on 18 benchmark tasks. We are also open-sourcing the base model GLM-4.1V-9B-Base to
43
+ support further research into the boundaries of VLM capabilities.
44
+
45
+ ![rl](https://raw.githubusercontent.com/THUDM/GLM-4.1V-Thinking/refs/heads/main/resources/rl.jpeg)
46
+
47
+ Compared to the previous generation models CogVLM2 and the GLM-4V series, **GLM-4.1V-Thinking** offers the
48
+ following improvements:
49
+
50
+ 1. The first reasoning-focused model in the series, achieving world-leading performance not only in mathematics but also
51
+ across various sub-domains.
52
+ 2. Supports **64k** context length.
53
+ 3. Handles **arbitrary aspect ratios** and up to **4K** image resolution.
54
+ 4. Provides an open-source version supporting both **Chinese and English bilingual** usage.
55
+
56
+ ## Benchmark Performance
57
+
58
+ By incorporating the Chain-of-Thought reasoning paradigm, GLM-4.1V-9B-Thinking significantly improves answer accuracy,
59
+ richness, and interpretability. It comprehensively surpasses traditional non-reasoning visual models.
60
+ Out of 28 benchmark tasks, it achieved the best performance among 10B-level models on 23 tasks,
61
+ and even outperformed the 72B-parameter Qwen-2.5-VL-72B on 18 tasks.
62
+
63
+ ![bench](https://raw.githubusercontent.com/THUDM/GLM-4.1V-Thinking/refs/heads/main/resources/bench.jpeg)
64
+
65
+ For video reasoning, web demo deployment, and more code, please check our [GitHub](https://github.com/THUDM/GLM-4.1V-Thinking).
66
+
67
+ ## Sample Usage
68
+
69
+ You can use the model with the `transformers` library as follows:
70
+
71
+ ```python
72
+ from transformers import AutoProcessor, AutoModelForConditionalGeneration
73
+ from PIL import Image
74
+ import torch
75
+
76
+ model_id = "THUDM/GLM-4.1V-9B-Base" # This model
77
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
78
+ model = AutoModelForConditionalGeneration.from_pretrained(model_id, trust_remote_code=True, torch_dtype=torch.bfloat16).to("cuda")
79
+
80
+ # Example image (replace with your image path or load from URL)
81
+ # Example image from the GitHub repo's resources (for illustrative purposes)
82
+ # import requests
83
+ # from io import BytesIO
84
+ # img_url = "https://raw.githubusercontent.com/THUDM/GLM-4.1V-Thinking/refs/heads/main/resources/rl.jpeg"
85
+ # response = requests.get(img_url)
86
+ # image = Image.open(BytesIO(response.content)).convert("RGB")
87
+ image_path = "./path/to/your/image.jpg" # Replace with your actual image path
88
+ image = Image.open(image_path).convert("RGB")
89
+
90
+ # Example prompt
91
+ prompt = "Describe this image in detail."
92
+
93
+ # Prepare messages for chat template
94
+ messages = [
95
+ {"role": "user", "content": f"<image>
96
+ {prompt}"}
97
+ ]
98
+
99
+ # Apply chat template and get inputs
100
+ inputs = processor.apply_chat_template(
101
+ messages,
102
+ add_generation_prompt=True,
103
+ return_tensors="pt"
104
+ )
105
+ inputs = inputs.to(model.device)
106
+
107
+ # Process image inputs separately
108
+ image_inputs = processor(images=image, return_tensors="pt").pixel_values.to(model.device)
109
+
110
+ # Generate response
111
+ generated_ids = model.generate(
112
+ input_ids=inputs.input_ids,
113
+ images=image_inputs,
114
+ max_new_tokens=512,
115
+ do_sample=True, # Set to False for greedy decoding
116
+ temperature=0.7,
117
+ top_p=0.9,
118
+ num_beams=1, # Set to >1 for beam search
119
+ )
120
+
121
+ # Decode and print the generated text
122
+ generated_text = processor.decode(generated_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
123
+ print(generated_text)
124
+ ```
125
+
126
+ ## Citation
127
+
128
+ If you use this model, please cite the following paper:
129
+
130
+ ```bibtex
131
+ @misc{vteam2025glm45vglm41vthinkingversatilemultimodal,
132
+ title={GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning},
133
+ author={V Team and Wenyi Hong and Wenmeng Yu and Xiaotao Gu and Guo Wang and Guobing Gan and Haomiao Tang and Jiale Cheng and Ji Qi and Junhui Ji and Lihang Pan and Shuaiqi Duan and Weihan Wang and Yan Wang and Yean Cheng and Zehai He and Zhe Su and Zhen Yang and Ziyang Pan and Aohan Zeng and Baoxu Wang and Bin Chen and Boyan Shi and Changyu Pang and Chenhui Zhang and Da Yin and Fan Yang and Guoqing Chen and Jiazheng Xu and Jiale Zhu and Jiali Chen and Jing Chen and Jinhao Chen and Jinghao Lin and Jinjiang Wang and Junjie Chen and Leqi Lei and Letian Gong and Leyi Pan and Mingdao Liu and Mingde Xu and Mingzhi Zhang and Qinkai Zheng and Sheng Yang and Shi Zhong and Shiyu Huang and Shuyuan Zhao and Siyan Xue and Shangqin Tu and Shengbiao Meng and Tianshu Zhang and Tianwei Luo and Tianxiang Hao and Tianyu Tong and Wenkai Li and Wei Jia and Xiao Liu and Xiaohan Zhang and Xin Lyu and Xinyue Fan and Xuancheng Huang and Yanling Wang and Yadong Xue and Yanfeng Wang and Yanzi Wang and Yifan An and Yifan Du and Yiming Shi and Yiheng Huang and Yilin Niu and Yuan Wang and Yuanchang Yue and Yuchen Li and Yutao Zhang and Yuting Wang and Yu Wang and Yuxuan Zhang and Zhao Xue and Zhenyu Hou and Zhengxiao Du and Zihan Wang and Peng Zhang and Debing Liu and Bin Xu and Juanzi Li and Minlie Huang and Yuxiao Dong and Jie Tang},
134
+ year={2025},
135
+ eprint={2507.01006},
136
+ archivePrefix={arXiv},
137
+ primaryClass={cs.CV},
138
+ url={https://arxiv.org/abs/2507.01006},
139
+ }
140
+ ```