feihu.hf
commited on
Commit
·
1c8249c
1
Parent(s):
d199b9e
update README
Browse files
README.md
CHANGED
@@ -201,10 +201,15 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
|
|
201 |
|
202 |
### How to Enable 1M Token Context
|
203 |
|
|
|
|
|
|
|
204 |
#### Step 1: Update Configuration File
|
205 |
|
206 |
Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
207 |
|
|
|
|
|
208 |
After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
|
209 |
|
210 |
#### Option 1: Using vLLM
|
@@ -212,7 +217,9 @@ After updating the config, proceed with either **vLLM** or **SGLang** for servin
|
|
212 |
To run Qwen with 1M context support:
|
213 |
|
214 |
```bash
|
215 |
-
|
|
|
|
|
216 |
```
|
217 |
|
218 |
Then launch the server with Dual Chunk Flash Attention enabled:
|
@@ -225,7 +232,8 @@ vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
|
|
225 |
--enable-chunked-prefill \
|
226 |
--max-num-batched-tokens 131072 \
|
227 |
--enforce-eager \
|
228 |
-
--max-num-seqs 1
|
|
|
229 |
```
|
230 |
|
231 |
##### Key Parameters
|
@@ -238,28 +246,14 @@ vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
|
|
238 |
| `--max-num-batched-tokens 131072` | Controls batch size during prefill; balances throughput and memory |
|
239 |
| `--enforce-eager` | Disables CUDA graph capture (required for dual chunk attention) |
|
240 |
| `--max-num-seqs 1` | Limits concurrent sequences due to extreme memory usage |
|
241 |
-
|
242 |
-
##### Troubleshooting:
|
243 |
-
|
244 |
-
1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache."
|
245 |
-
|
246 |
-
The VRAM reserved for the KV cache is insufficient. Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
|
247 |
-
|
248 |
-
2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
|
249 |
-
|
250 |
-
The VRAM reserved for activation weights is insufficient. You can try setting ``gpu_memory_utilization`` to 0.85 or lower, but be aware that this might reduce the VRAM available for the KV cache.
|
251 |
-
|
252 |
-
3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager."
|
253 |
-
|
254 |
-
The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len``.
|
255 |
-
|
256 |
|
257 |
#### Option 2: Using SGLang
|
258 |
|
259 |
First, clone and install the specialized branch:
|
260 |
|
261 |
```bash
|
262 |
-
git clone
|
263 |
cd sglang
|
264 |
pip install -e "python[all]"
|
265 |
```
|
@@ -282,10 +276,26 @@ python3 -m sglang.launch_server \
|
|
282 |
|---------|--------|
|
283 |
| `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
|
284 |
| `--context-length 1010000` | Defines max input length |
|
285 |
-
| `--mem-frac 0.75` |
|
286 |
| `--tp 8` | Tensor parallelism size (matches model sharding) |
|
287 |
| `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
|
288 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
289 |
#### Long-Context Performance
|
290 |
|
291 |
We test the model on an 1M version of the [RULER](https://arxiv.org/abs/2404.06654) benchmark.
|
|
|
201 |
|
202 |
### How to Enable 1M Token Context
|
203 |
|
204 |
+
> [!NOTE]
|
205 |
+
> To effectively process a 1 million token context, users will require approximately **1000 GB** of total GPU memory. This accounts for model weights, KV-cache storage, and peak activation memory demands.
|
206 |
+
|
207 |
#### Step 1: Update Configuration File
|
208 |
|
209 |
Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
210 |
|
211 |
+
#### Step 2: Start Model Server
|
212 |
+
|
213 |
After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
|
214 |
|
215 |
#### Option 1: Using vLLM
|
|
|
217 |
To run Qwen with 1M context support:
|
218 |
|
219 |
```bash
|
220 |
+
git clone https://github.com/vllm-project/vllm.git
|
221 |
+
cd vllm
|
222 |
+
pip install -e .
|
223 |
```
|
224 |
|
225 |
Then launch the server with Dual Chunk Flash Attention enabled:
|
|
|
232 |
--enable-chunked-prefill \
|
233 |
--max-num-batched-tokens 131072 \
|
234 |
--enforce-eager \
|
235 |
+
--max-num-seqs 1 \
|
236 |
+
--gpu-memory-utilization 0.85
|
237 |
```
|
238 |
|
239 |
##### Key Parameters
|
|
|
246 |
| `--max-num-batched-tokens 131072` | Controls batch size during prefill; balances throughput and memory |
|
247 |
| `--enforce-eager` | Disables CUDA graph capture (required for dual chunk attention) |
|
248 |
| `--max-num-seqs 1` | Limits concurrent sequences due to extreme memory usage |
|
249 |
+
| `--gpu-memory-utilization 0.85` | Set the fraction of GPU memory to be used for the model executor |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
250 |
|
251 |
#### Option 2: Using SGLang
|
252 |
|
253 |
First, clone and install the specialized branch:
|
254 |
|
255 |
```bash
|
256 |
+
git clone https://github.com/sgl-project/sglang.git
|
257 |
cd sglang
|
258 |
pip install -e "python[all]"
|
259 |
```
|
|
|
276 |
|---------|--------|
|
277 |
| `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
|
278 |
| `--context-length 1010000` | Defines max input length |
|
279 |
+
| `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
|
280 |
| `--tp 8` | Tensor parallelism size (matches model sharding) |
|
281 |
| `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
|
282 |
|
283 |
+
#### Troubleshooting:
|
284 |
+
|
285 |
+
1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache."
|
286 |
+
|
287 |
+
The VRAM reserved for the KV cache is insufficient.
|
288 |
+
- vLLM: Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
|
289 |
+
- SGLang: Consider reducing the ``context-length`` or increasing the ``tp``. Alternatively, you can reduce ``chunked-prefill-size``, although this may significantly slow down inference.
|
290 |
+
|
291 |
+
2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
|
292 |
+
|
293 |
+
The VRAM reserved for activation weights is insufficient. You can try lowering ``gpu_memory_utilization`` or ``mem-frac``, but be aware that this might reduce the VRAM available for the KV cache.
|
294 |
+
|
295 |
+
3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager."
|
296 |
+
|
297 |
+
The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len`` or ``context-length``.
|
298 |
+
|
299 |
#### Long-Context Performance
|
300 |
|
301 |
We test the model on an 1M version of the [RULER](https://arxiv.org/abs/2404.06654) benchmark.
|