dasLOL commited on
Commit
00f32fd
·
verified ·
1 Parent(s): 456d313

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +239 -0
README.md ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ library_name: transformers
6
+ license: mit
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - moe
10
+ - fp8
11
+ ---
12
+
13
+ # GLM-4.5-FP8
14
+
15
+ [📚 Paper](https://huggingface.co/papers/2508.06471) | [💻 Code](https://github.com/zai-org/GLM-4.5) | [🌐 Project Page](https://z.ai/blog/glm-4.5)
16
+
17
+ <div align="center">
18
+ <img src=https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg width="15%"/>
19
+ </div>
20
+ <p align="center">
21
+ 👋 Join our <a href="https://discord.gg/QR7SARHRxK" target="_blank">Discord</a> community.
22
+ <br>
23
+ 📖 Check out the GLM-4.5 <a href="https://z.ai/blog/glm-4.5" target="_blank">technical blog</a>.
24
+ <br>
25
+ 📍 Use GLM-4.5 API services on <a href="https://docs.z.ai/guides/llm/glm-4.5">Z.ai API Platform (Global)</a> or <br> <a href="https://docs.bigmodel.cn/cn/guide/models/text/glm-4.5">Zhipu AI Open Platform (Mainland China)</a>.
26
+ <br>
27
+ 👉 One click to <a href="https://chat.z.ai">GLM-4.5</a>.
28
+ </p>
29
+
30
+ ## Paper Abstract
31
+
32
+ We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at this https URL .
33
+
34
+ ## Model Introduction
35
+
36
+ The **GLM-4.5** series models are foundation models designed for intelligent agents. GLM-4.5 has **355** billion total parameters with **32** billion active parameters, while GLM-4.5-Air adopts a more compact design with **106** billion total parameters and **12** billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.
37
+
38
+ Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses.
39
+
40
+ We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development.
41
+
42
+ As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of **63.2**, in the **3rd** place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at **59.8** while maintaining superior efficiency.
43
+
44
+ ![bench](https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/bench.png)
45
+
46
+ For more eval results, show cases, and technical details, please visit our [technical blog](https://z.ai/blog/glm-4.5) or refer to the [technical report (paper)](https://huggingface.co/papers/2508.06471).
47
+
48
+ The model code, tool parser and reasoning parser can be found in the implementation of [transformers](https://github.com/huggingface/transformers/tree/main/src/transformers/models/glm4_moe), [vLLM](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/glm4_moe_mtp.py) and [SGLang](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/glm4_moe.py).
49
+
50
+ ## Model Downloads
51
+
52
+ You can directly experience the model on [Hugging Face](https://huggingface.co/spaces/zai-org/GLM-4.5-Space)
53
+ or [ModelScope](https://modelscope.cn/studios/ZhipuAI/GLM-4.5-Demo) or download the model by following the links below.
54
+
55
+ | Model | Download Links | Model Size | Precision |
56
+ |------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|------------|-----------|
57
+ | GLM-4.5 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5) | 355B-A32B | BF16 |
58
+ | GLM-4.5-Air | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air) | 106B-A12B | BF16 |
59
+ | GLM-4.5-FP8 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-FP8)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-FP8) | 355B-A32B | FP8 |
60
+ | GLM-4.5-Air-FP8 | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air-FP8)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air-FP8) | 106B-A12B | FP8 |
61
+ | GLM-4.5-Base | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Base)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Base) | 355B-A32B | BF16 |
62
+ | GLM-4.5-Air-Base | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-4.5-Air-Base)<br> [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-4.5-Air-Base) | 106B-A12B | BF16 |
63
+
64
+ ## System Requirements
65
+
66
+ ### Inference
67
+
68
+ We provide minimum and recommended configurations for "full-featured" model inference. The data in the table below is
69
+ based on the following conditions:
70
+
71
+ 1. All models use MTP layers and specify
72
+ `--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4` to ensure competitive
73
+ inference speed.
74
+ 2. The `cpu-offload` parameter is not used.
75
+ 3. Inference batch size does not exceed `8`.
76
+ 4. All are executed on devices that natively support FP8 inference, ensuring both weights and cache are in FP8 format.
77
+ 5. Server memory must exceed `1T` to ensure normal model loading and operation.
78
+
79
+ The models can run under the configurations in the table below:
80
+
81
+ | Model | Precision | GPU Type and Count | Test Framework |
82
+ |-------------|-----------|----------------------|----------------|
83
+ | GLM-4.5 | BF16 | H100 x 16 / H200 x 8 | sglang |
84
+ | GLM-4.5 | FP8 | H100 x 8 / H200 x 4 | sglang |
85
+ | GLM-4.5-Air | BF16 | H100 x 4 / H200 x 2 | sglang |
86
+ | GLM-4.5-Air | FP8 | H100 x 2 / H200 x 1 | sglang |
87
+
88
+ Under the configurations in the table below, the models can utilize their full 128K context length:
89
+
90
+ | Model | Precision | GPU Type and Count | Test Framework |
91
+ |-------------|-----------|-----------------------|----------------|
92
+ | GLM-4.5 | BF16 | H100 x 32 / H200 x 16 | sglang |
93
+ | GLM-4.5 | FP8 | H100 x 16 / H200 x 8 | sglang |
94
+ | GLM-4.5-Air | BF16 | H100 x 8 / H200 x 4 | sglang |
95
+ | GLM-4.5-Air | FP8 | H100 x 4 / H200 x 2 | sglang |\
96
+
97
+ ### Fine-tuning
98
+
99
+ The code can run under the configurations in the table below
100
+ using [Llama Factory](https://github.com/hiyouga/LLaMA-Factory):
101
+
102
+ | Model | GPU Type and Count | Strategy | Batch Size (per GPU) |
103
+ |-------------|--------------------|----------|----------------------|
104
+ | GLM-4.5 | H100 x 16 | Lora | 1 |
105
+ | GLM-4.5-Air | H100 x 4 | Lora | 1 |
106
+
107
+ The code can run under the configurations in the table below using [Swift](https://github.com/modelscope/ms-swift):
108
+
109
+ | Model | GPU Type and Count | Strategy | Batch Size (per GPU) |
110
+ |-------------|--------------------|----------|----------------------|
111
+ | GLM-4.5 | H20 (96GiB) x 16 | Lora | 1 |
112
+ | GLM-4.5-Air | H20 (96GiB) x 4 | Lora | 1 |
113
+ | GLM-4.5 | H20 (96GiB) x 128 | SFT | 1 |
114
+ | GLM-4.5-Air | H20 (96GiB) x 32 | SFT | 1 |
115
+ | GLM-4.5 | H20 (96GiB) x 128 | RL | 1 |
116
+ | GLM-4.5-Air | H20 (96GiB) x 32 | RL | 1 |
117
+
118
+ ## Quick Start
119
+
120
+ For more comprehensive details and setup instructions, please refer to our [GitHub page](https://github.com/zai-org/GLM-4.5).
121
+
122
+ ### Transformers Inference
123
+
124
+ Here is a basic example to run inference with the `transformers` library, demonstrating both thinking and non-thinking modes:
125
+
126
+ ```python
127
+ from transformers import AutoTokenizer, AutoModelForCausalLM
128
+ import torch
129
+
130
+ # Load model and tokenizer
131
+ model_id = "zai-org/GLM-4.5-FP8"
132
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
133
+ model = AutoModelForCausalLM.from_pretrained(
134
+ model_id,
135
+ torch_dtype=torch.bfloat16, # Adjust as needed (e.g., torch.float8 for FP8 models)
136
+ low_cpu_mem_usage=True,
137
+ device_map="auto",
138
+ trust_remote_code=True
139
+ )
140
+ model.eval()
141
+
142
+ messages = [
143
+ {"role": "user", "content": "Hello, how are you?"},
144
+ ]
145
+
146
+ # Example for non-thinking mode (direct response)
147
+ # The `add_nothink_token=True` parameter triggers non-thinking mode.
148
+ # This mode is suitable for straightforward questions not requiring complex reasoning or tool usage.
149
+ inputs_nothink_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False, add_nothink_token=True)
150
+ input_ids_nothink = tokenizer(inputs_nothink_text, return_tensors="pt").input_ids.to(model.device)
151
+ outputs_nothink = model.generate(input_ids_nothink, max_new_tokens=100)
152
+ print("Non-thinking mode response:", tokenizer.decode(outputs_nothink[0][len(input_ids_nothink[0]):], skip_special_tokens=True))
153
+
154
+ # Example for thinking mode (for complex reasoning or tool usage)
155
+ # By default, `add_nothink_token=False` or omitting it triggers thinking mode.
156
+ # This mode allows the model to perform multi-step reasoning, break down tasks, and utilize tools.
157
+ inputs_think_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False, add_nothink_token=False)
158
+ input_ids_think = tokenizer(inputs_think_text, return_tensors="pt").input_ids.to(model.device)
159
+ outputs_think = model.generate(input_ids_think, max_new_tokens=100)
160
+ print("Thinking mode response:", tokenizer.decode(outputs_think[0][len(input_ids_think[0]):], skip_special_tokens=True))
161
+ ```
162
+
163
+ ### vLLM
164
+
165
+ + Both BF16 and FP8 can be started with the following code:
166
+
167
+ ```shell
168
+ vllm serve zai-org/GLM-4.5-Air \
169
+ --tensor-parallel-size 8 \
170
+ --tool-call-parser glm45 \
171
+ --reasoning-parser glm45 \
172
+ --enable-auto-tool-choice \
173
+ --served-model-name glm-4.5-air
174
+ ```
175
+
176
+ If you're using 8x H100 GPUs and encounter insufficient memory when running the GLM-4.5 model, you'll need
177
+ `--cpu-offload-gb 16` (only applicable to vLLM).
178
+
179
+ If you encounter `flash infer` issues, use `VLLM_ATTENTION_BACKEND=XFORMERS` as a temporary replacement. You can also
180
+ specify `TORCH_CUDA_ARCH_LIST='9.0+PTX'` to use `flash infer` (different GPUs have different TORCH_CUDA_ARCH_LIST
181
+ values, please check accordingly).
182
+
183
+ ### SGLang
184
+
185
+ + BF16
186
+
187
+ ```shell
188
+ python3 -m sglang.launch_server \
189
+ --model-path zai-org/GLM-4.5-Air \
190
+ --tp-size 8 \
191
+ --tool-call-parser glm45 \
192
+ --reasoning-parser glm45 \
193
+ --speculative-algorithm EAGLE \
194
+ --speculative-num-steps 3 \
195
+ --speculative-eagle-topk 1 \
196
+ --speculative-num-draft-tokens 4 \
197
+ --mem-fraction-static 0.7 \
198
+ --served-model-name glm-4.5-air \
199
+ --host 0.0.0.0 \
200
+ --port 8000
201
+ ```
202
+
203
+ + FP8
204
+
205
+ ```shell
206
+ python3 -m sglang.launch_server \
207
+ --model-path zai-org/GLM-4.5-Air-FP8 \
208
+ --tp-size 4 \
209
+ --tool-call-parser glm45 \
210
+ --reasoning-parser glm45 \
211
+ --speculative-algorithm EAGLE \
212
+ --speculative-num-steps 3 \
213
+ --speculative-eagle-topk 1 \
214
+ --speculative-num-draft-tokens 4 \
215
+ --mem-fraction-static 0.7 \
216
+ --disable-shared-experts-fusion \
217
+ --served-model-name glm-4.5-air-fp8 \
218
+ --host 0.0.0.0 \
219
+ --port 8000
220
+ ```
221
+
222
+ ### Request Parameter Instructions
223
+
224
+ + When using `vLLM` and `SGLang`, thinking mode is enabled by default when sending requests. If you want to disable the
225
+ thinking switch, you need to add the `extra_body={"chat_template_kwargs": {"enable_thinking": False}}` parameter.
226
+ + Both support tool calling. Please use OpenAI-style tool description format for calls.
227
+ + For specific code, please refer to `api_request.py` in the `inference` folder.
228
+
229
+ ## Citation
230
+ If you find our work useful or helpful for your R&D works, please feel free to cite our paper as below.
231
+
232
+ ```bibtex
233
+ @article{zhu2025glm45,
234
+ title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models},
235
+ author={Zhu, Xiaohan and Sun, Tianxiang and Wang, Hao and Xu, Yi and Zhang, Yichen and Wang, Junyi and Huang, Junjie and Zeng, Jiao and Huang, Yangyang and Gu, Ruipeng and Zhang, Xiaodong and Du, Mengying and Han, Hao and Li, Chao and Xiao, Jin and Guo, Weidong and Li, Zhen and Lu, Jingkang and Chen, Shu and Chen, Huadong and Chen, Peng and Liu, Hongguang and Guo, Guang and Liu, Wen and Yang, Tianyu and Hu, Bo and Zhang, Wenmin and Sun, Maosong},
236
+ journal={arXiv preprint arXiv:2508.06471},
237
+ year={2025}
238
+ }
239
+ ```