Zenithwang commited on
Commit
14c8393
·
verified ·
1 Parent(s): cda3a8d

Create deploy_guidance.md

Browse files
Files changed (1) hide show
  1. deploy_guidance.md +210 -0
deploy_guidance.md ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Step3 Model Deployment Guide
2
+
3
+ This document provides deployment guidance for Step3 model.
4
+
5
+ Currently, our open-source deployment guide only includes TP and DP+TP deployment methods. The AFD (Attn-FFN Disaggregated) approach mentioned in our [paper](https://arxiv.org/abs/2507.19427) is still under joint development with the open-source community to achieve optimal performance. Please stay tuned for updates on our open-source progress.
6
+
7
+ ## Overview
8
+
9
+ Step3 is a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs.
10
+
11
+ For out fp8 version, about 326G memory is required.
12
+ The smallest deployment unit for this version is 8xH20 with either Tensor Parallel (TP) or Data Parallel + Tensor Parallel (DP+TP).
13
+
14
+ For out bf16 version, about 642G memory is required.
15
+ The smallest deployment unit for this version is 16xH20 with either Tensor Parallel (TP) or Data Parallel + Tensor Parallel (DP+TP).
16
+
17
+ ## Deployment Options
18
+
19
+ ### vLLM Deployment
20
+
21
+ Please make sure to use nightly version of vllm. For details, please refer to [vllm nightly installation doc](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#pre-built-wheels).
22
+ ```bash
23
+ uv pip install -U vllm \
24
+ --torch-backend=auto \
25
+ --extra-index-url https://wheels.vllm.ai/nightly
26
+ ```
27
+
28
+ We recommend to use the following command to deploy the model:
29
+
30
+ **`max_num_batched_tokens` should be larger than 4096. If not set, the default value is 8192.**
31
+
32
+ #### BF16 Model
33
+ ##### Tensor Parallelism(Serving on 16xH20):
34
+
35
+ ```bash
36
+ # start ray on node 0 and node 1
37
+
38
+ # node 0:
39
+ vllm serve /path/to/step3 \
40
+ --tensor-parallel-size 16 \
41
+ --reasoning-parser step3 \
42
+ --enable-auto-tool-choice \
43
+ --tool-call-parser step3 \
44
+ --trust-remote-code \
45
+ --port $PORT_SERVING
46
+ ```
47
+
48
+ ###### Data Parallelism + Tensor Parallelism(Serving on 16xH20):
49
+ Step3 only has single kv head, so attention data parallelism can be adopted to reduce the kv cache memory usage.
50
+
51
+ ```bash
52
+ # start ray on node 0 and node 1
53
+
54
+ # node 0:
55
+ vllm serve /path/to/step3 \
56
+ --data-parallel-size 16 \
57
+ --tensor-parallel-size 1 \
58
+ --reasoning-parser step3 \
59
+ --enable-auto-tool-choice \
60
+ --tool-call-parser step3 \
61
+ --trust-remote-code \
62
+ ```
63
+
64
+ #### FP8 Model
65
+ ##### Tensor Parallelism(Serving on 8xH20):
66
+
67
+ ```bash
68
+ vllm serve /path/to/step3-fp8 \
69
+ --tensor-parallel-size 8 \
70
+ --reasoning-parser step3 \
71
+ --enable-auto-tool-choice \
72
+ --tool-call-parser step3 \
73
+ --gpu-memory-utilization 0.85 \
74
+ --trust-remote-code \
75
+ ```
76
+
77
+ ###### Data Parallelism + Tensor Parallelism(Serving on 8xH20):
78
+
79
+ ```bash
80
+ vllm serve /path/to/step3-fp8 \
81
+ --data-parallel-size 8 \
82
+ --tensor-parallel-size 1 \
83
+ --reasoning-parser step3 \
84
+ --enable-auto-tool-choice \
85
+ --tool-call-parser step3 \
86
+ --trust-remote-code \
87
+ ```
88
+
89
+
90
+ ##### Key parameter notes:
91
+
92
+ * `reasoning-parser`: If enabled, reasoning content in the response will be parsed into a structured format.
93
+ * `tool-call-parser`: If enabled, tool call content in the response will be parsed into a structured format.
94
+
95
+ ### SGLang Deployment
96
+
97
+ 0.4.10 or later is needed for SGLang.
98
+
99
+ ```
100
+ pip3 install "sglang[all]>=0.4.10"
101
+ ```
102
+
103
+ #### BF16 Model
104
+ ##### Tensor Parallelism(Serving on 16xH20):
105
+
106
+ ```bash
107
+ # start ray on node 0 and node 1
108
+
109
+ # node 0:
110
+ python -m sglang.launch_server \
111
+ --model-path /path/to/step3 \
112
+ --trust-remote-code \
113
+ --tool-call-parser step3 \
114
+ --reasoning-parser step3 \
115
+ --tp 16
116
+ ```
117
+
118
+ #### FP8 Model
119
+ ##### Tensor Parallelism(Serving on 8xH20):
120
+
121
+ ```bash
122
+ python -m sglang.launch_server \
123
+ --model-path /path/to/step3-fp8 \
124
+ --trust-remote-code \
125
+ --tool-call-parser step3 \
126
+ --reasoning-parser step3-fp8 \
127
+ --tp 8
128
+ ```
129
+
130
+
131
+ ### TensorRT-LLM Deployment
132
+
133
+ [Coming soon...]
134
+
135
+
136
+ ## Client Request Examples
137
+
138
+ Then you can use the chat API as below:
139
+ ```python
140
+ from openai import OpenAI
141
+
142
+ # Set OpenAI's API key and API base to use vLLM's API server.
143
+ openai_api_key = "EMPTY"
144
+ openai_api_base = "http://localhost:8000/v1"
145
+
146
+ client = OpenAI(
147
+ api_key=openai_api_key,
148
+ base_url=openai_api_base,
149
+ )
150
+
151
+ chat_response = client.chat.completions.create(
152
+ model="step3",
153
+ messages=[
154
+ {"role": "system", "content": "You are a helpful assistant."},
155
+ {
156
+ "role": "user",
157
+ "content": [
158
+ {
159
+ "type": "image_url",
160
+ "image_url": {
161
+ "url": "https://xxxxx.png"
162
+ },
163
+ },
164
+ {"type": "text", "text": "Please describe the image."},
165
+ ],
166
+ },
167
+ ],
168
+ )
169
+ print("Chat response:", chat_response)
170
+ ```
171
+ You can also upload base64-encoded local images:
172
+
173
+ ```python
174
+ import base64
175
+ from openai import OpenAI
176
+ # Set OpenAI's API key and API base to use vLLM's API server.
177
+ openai_api_key = "EMPTY"
178
+ openai_api_base = "http://localhost:8000/v1"
179
+ client = OpenAI(
180
+ api_key=openai_api_key,
181
+ base_url=openai_api_base,
182
+ )
183
+ image_path = "/path/to/local/image.png"
184
+ with open(image_path, "rb") as f:
185
+ encoded_image = base64.b64encode(f.read())
186
+ encoded_image_text = encoded_image.decode("utf-8")
187
+ base64_step = f"data:image;base64,{encoded_image_text}"
188
+ chat_response = client.chat.completions.create(
189
+ model="step3",
190
+ messages=[
191
+ {"role": "system", "content": "You are a helpful assistant."},
192
+ {
193
+ "role": "user",
194
+ "content": [
195
+ {
196
+ "type": "image_url",
197
+ "image_url": {
198
+ "url": base64_step
199
+ },
200
+ },
201
+ {"type": "text", "text": "Please describe the image."},
202
+ ],
203
+ },
204
+ ],
205
+ )
206
+ print("Chat response:", chat_response)
207
+
208
+ ```
209
+
210
+ Note: In our image preprocessing pipeline, we implement a multi-patch mechanism to handle large images. If the input image exceeds 728x728 pixels, the system will automatically apply image cropping logic to get patches of the image.