feihu.hf commited on
Commit
d199b9e
·
1 Parent(s): 9fe72e5

update 1m support

Browse files
Files changed (3) hide show
  1. README.md +120 -1
  2. config_1m.json +0 -0
  3. tokenizer_config.json +1 -1
README.md CHANGED
@@ -33,7 +33,7 @@ We introduce the updated version of the **Qwen3-235B-A22B non-thinking mode**, n
33
  - Number of Attention Heads (GQA): 64 for Q and 4 for KV
34
  - Number of Experts: 128
35
  - Number of Activated Experts: 8
36
- - Context Length: **262,144 natively**.
37
 
38
  **NOTE: This model supports only non-thinking mode and does not generate ``<think></think>`` blocks in its output. Meanwhile, specifying `enable_thinking=False` is no longer required.**
39
 
@@ -188,6 +188,118 @@ for responses in bot.run(messages=messages):
188
  print(responses)
189
  ```
190
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191
  ## Best Practices
192
 
193
  To achieve optimal performance, we recommend the following settings:
@@ -216,4 +328,11 @@ If you find our work helpful, feel free to give us a cite.
216
  primaryClass={cs.CL},
217
  url={https://arxiv.org/abs/2505.09388},
218
  }
 
 
 
 
 
 
 
219
  ```
 
33
  - Number of Attention Heads (GQA): 64 for Q and 4 for KV
34
  - Number of Experts: 128
35
  - Number of Activated Experts: 8
36
+ - Context Length: **262,144 natively and extendable up to 1,010,000 tokens**
37
 
38
  **NOTE: This model supports only non-thinking mode and does not generate ``<think></think>`` blocks in its output. Meanwhile, specifying `enable_thinking=False` is no longer required.**
39
 
 
188
  print(responses)
189
  ```
190
 
191
+ ## Processing Ultra-Long Texts
192
+
193
+ To support **ultra-long context processing** (up to **1 million tokens**), we integrate two key techniques:
194
+
195
+ - **[Dual Chunk Attention](https://arxiv.org/abs/2402.17463) (DCA)**: A length extrapolation method that splits long sequences into manageable chunks while preserving global coherence.
196
+ - **[MInference](https://arxiv.org/abs/2407.02490)**: A sparse attention mechanism that reduces computational overhead by focusing on critical token interactions.
197
+
198
+ Together, these innovations significantly improve both **generation quality** and **inference efficiency** for sequences beyond 256K tokens. On sequences approaching 1M tokens, the system achieves up to a **3× speedup** compared to standard attention implementations.
199
+
200
+ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.org/abs/2501.15383).
201
+
202
+ ### How to Enable 1M Token Context
203
+
204
+ #### Step 1: Update Configuration File
205
+
206
+ Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
207
+
208
+ After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
209
+
210
+ #### Option 1: Using vLLM
211
+
212
+ To run Qwen with 1M context support:
213
+
214
+ ```bash
215
+ pip install vllm>=0.10.0
216
+ ```
217
+
218
+ Then launch the server with Dual Chunk Flash Attention enabled:
219
+
220
+ ```bash
221
+ VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
222
+ vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
223
+ --tensor-parallel-size 8 \
224
+ --max-model-len 1010000 \
225
+ --enable-chunked-prefill \
226
+ --max-num-batched-tokens 131072 \
227
+ --enforce-eager \
228
+ --max-num-seqs 1
229
+ ```
230
+
231
+ ##### Key Parameters
232
+
233
+ | Parameter | Purpose |
234
+ |--------|--------|
235
+ | `VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN` | Enables the custom attention kernel for long-context efficiency |
236
+ | `--max-model-len 1010000` | Sets maximum context length to ~1M tokens |
237
+ | `--enable-chunked-prefill` | Allows chunked prefill for very long inputs (avoids OOM) |
238
+ | `--max-num-batched-tokens 131072` | Controls batch size during prefill; balances throughput and memory |
239
+ | `--enforce-eager` | Disables CUDA graph capture (required for dual chunk attention) |
240
+ | `--max-num-seqs 1` | Limits concurrent sequences due to extreme memory usage |
241
+
242
+ ##### Troubleshooting:
243
+
244
+ 1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache."
245
+
246
+ The VRAM reserved for the KV cache is insufficient. Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
247
+
248
+ 2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
249
+
250
+ The VRAM reserved for activation weights is insufficient. You can try setting ``gpu_memory_utilization`` to 0.85 or lower, but be aware that this might reduce the VRAM available for the KV cache.
251
+
252
+ 3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager."
253
+
254
+ The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len``.
255
+
256
+
257
+ #### Option 2: Using SGLang
258
+
259
+ First, clone and install the specialized branch:
260
+
261
+ ```bash
262
+ git clone -b qwen-1m-dca https://github.com/sgl-project/sglang.git
263
+ cd sglang
264
+ pip install -e "python[all]"
265
+ ```
266
+
267
+ Launch the server with DCA support:
268
+
269
+ ```bash
270
+ python3 -m sglang.launch_server \
271
+ --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \
272
+ --context-length 1010000 \
273
+ --mem-frac 0.75 \
274
+ --attention-backend dual_chunk_flash_attn \
275
+ --tp 8 \
276
+ --chunked-prefill-size 131072
277
+ ```
278
+
279
+ ##### Key Parameters
280
+
281
+ | Parameter | Purpose |
282
+ |---------|--------|
283
+ | `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
284
+ | `--context-length 1010000` | Defines max input length |
285
+ | `--mem-frac 0.75` | Allocates 75% GPU memory to KV cache (adjust based on hardware) |
286
+ | `--tp 8` | Tensor parallelism size (matches model sharding) |
287
+ | `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
288
+
289
+ #### Long-Context Performance
290
+
291
+ We test the model on an 1M version of the [RULER](https://arxiv.org/abs/2404.06654) benchmark.
292
+
293
+ | Model Name | Acc avg | 4k | 8k | 16k | 32k | 64k | 96k | 128k | 192k | 256k | 384k | 512k | 640k | 768k | 896k | 1000k |
294
+ |---------------------------------------------|---------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|-------|
295
+ | Qwen3-235B-A22B (Non-Thinking) | 83.9 | 97.7 | 96.1 | 97.5 | 96.1 | 94.2 | 90.3 | 88.5 | 85.0 | 82.1 | 79.2 | 74.4 | 70.0 | 71.0 | 68.5 | 68.0 |
296
+ | Qwen3-235B-A22B-Instruct-2507 (Full Attention) | 92.5 | 98.5 | 97.6 | 96.9 | 97.3 | 95.8 | 94.9 | 93.9 | 94.5 | 91.0 | 92.2 | 90.9 | 87.8 | 84.8 | 86.5 | 84.5 |
297
+ | Qwen3-235B-A22B-Instruct-2507 (Sparse Attention) | 91.7 | 98.5 | 97.2 | 97.3 | 97.7 | 96.6 | 94.6 | 92.8 | 94.3 | 90.5 | 89.7 | 89.5 | 86.4 | 83.6 | 84.2 | 82.5 |
298
+
299
+
300
+ * All models are evaluated with Dual Chunk Attention enabled.
301
+ * Since the evaluation is time-consuming, we use 260 samples for each length (13 sub-tasks, 20 samples for each).
302
+
303
  ## Best Practices
304
 
305
  To achieve optimal performance, we recommend the following settings:
 
328
  primaryClass={cs.CL},
329
  url={https://arxiv.org/abs/2505.09388},
330
  }
331
+
332
+ @article{qwen2.5-1m,
333
+ title={Qwen2.5-1M Technical Report},
334
+ author={An Yang and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoyan Huang and Jiandong Jiang and Jianhong Tu and Jianwei Zhang and Jingren Zhou and Junyang Lin and Kai Dang and Kexin Yang and Le Yu and Mei Li and Minmin Sun and Qin Zhu and Rui Men and Tao He and Weijia Xu and Wenbiao Yin and Wenyuan Yu and Xiafei Qiu and Xingzhang Ren and Xinlong Yang and Yong Li and Zhiying Xu and Zipeng Zhang},
335
+ journal={arXiv preprint arXiv:2501.15383},
336
+ year={2025}
337
+ }
338
  ```
config_1m.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -230,7 +230,7 @@
230
  "clean_up_tokenization_spaces": false,
231
  "eos_token": "<|im_end|>",
232
  "errors": "replace",
233
- "model_max_length": 262144,
234
  "pad_token": "<|endoftext|>",
235
  "split_special_tokens": false,
236
  "tokenizer_class": "Qwen2Tokenizer",
 
230
  "clean_up_tokenization_spaces": false,
231
  "eos_token": "<|im_end|>",
232
  "errors": "replace",
233
+ "model_max_length": 1010000,
234
  "pad_token": "<|endoftext|>",
235
  "split_special_tokens": false,
236
  "tokenizer_class": "Qwen2Tokenizer",