update 1m support

Files changed (3) hide show

README.md +120 -1
config_1m.json +0 -0
tokenizer_config.json +1 -1

README.md CHANGED Viewed

@@ -33,7 +33,7 @@ We introduce the updated version of the **Qwen3-235B-A22B non-thinking mode**, n
 - Number of Attention Heads (GQA): 64 for Q and 4 for KV
 - Number of Experts: 128
 - Number of Activated Experts: 8
-- Context Length: **262,144 natively**.
 **NOTE: This model supports only non-thinking mode and does not generate ``<think></think>`` blocks in its output. Meanwhile, specifying `enable_thinking=False` is no longer required.**
@@ -188,6 +188,118 @@ for responses in bot.run(messages=messages):
 print(responses)
 ```
 ## Best Practices
 To achieve optimal performance, we recommend the following settings:
@@ -216,4 +328,11 @@ If you find our work helpful, feel free to give us a cite.
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2505.09388},
 }
 ```

 - Number of Attention Heads (GQA): 64 for Q and 4 for KV
 - Number of Experts: 128
 - Number of Activated Experts: 8
+- Context Length: **262,144 natively and extendable up to 1,010,000 tokens**
 **NOTE: This model supports only non-thinking mode and does not generate ``<think></think>`` blocks in its output. Meanwhile, specifying `enable_thinking=False` is no longer required.**
 print(responses)
 ```
+## Processing Ultra-Long Texts
+To support **ultra-long context processing** (up to **1 million tokens**), we integrate two key techniques:
+- **[Dual Chunk Attention](https://arxiv.org/abs/2402.17463) (DCA)**: A length extrapolation method that splits long sequences into manageable chunks while preserving global coherence.
+- **[MInference](https://arxiv.org/abs/2407.02490)**: A sparse attention mechanism that reduces computational overhead by focusing on critical token interactions.
+Together, these innovations significantly improve both **generation quality** and **inference efficiency** for sequences beyond 256K tokens. On sequences approaching 1M tokens, the system achieves up to a **3× speedup** compared to standard attention implementations.
+For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.org/abs/2501.15383).
+### How to Enable 1M Token Context
+#### Step 1: Update Configuration File
+Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
+After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
+#### Option 1: Using vLLM
+To run Qwen with 1M context support:
+```bash
+pip install vllm>=0.10.0
+```
+Then launch the server with Dual Chunk Flash Attention enabled:
+```bash
+VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
+vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
+  --tensor-parallel-size 8 \
+  --max-model-len 1010000 \
+  --enable-chunked-prefill \
+  --max-num-batched-tokens 131072 \
+  --enforce-eager \
+  --max-num-seqs 1
+```
+##### Key Parameters
+| Parameter | Purpose |
+|--------|--------|
+| `VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN` | Enables the custom attention kernel for long-context efficiency |
+| `--max-model-len 1010000` | Sets maximum context length to ~1M tokens |
+| `--enable-chunked-prefill` | Allows chunked prefill for very long inputs (avoids OOM) |
+| `--max-num-batched-tokens 131072` | Controls batch size during prefill; balances throughput and memory |
+| `--enforce-eager` | Disables CUDA graph capture (required for dual chunk attention) |
+| `--max-num-seqs 1` | Limits concurrent sequences due to extreme memory usage |
+##### Troubleshooting:
+1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache."
+    The VRAM reserved for the KV cache is insufficient. Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
+2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
+    The VRAM reserved for activation weights is insufficient. You can try setting ``gpu_memory_utilization`` to 0.85 or lower, but be aware that this might reduce the VRAM available for the KV cache.
+3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager."
+    The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len``.
+#### Option 2: Using SGLang
+First, clone and install the specialized branch:
+```bash
+git clone -b qwen-1m-dca https://github.com/sgl-project/sglang.git
+cd sglang
+pip install -e "python[all]"
+```
+Launch the server with DCA support:
+```bash
+python3 -m sglang.launch_server \
+    --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \
+    --context-length 1010000 \
+    --mem-frac 0.75 \
+    --attention-backend dual_chunk_flash_attn \
+    --tp 8 \
+    --chunked-prefill-size 131072
+```
+##### Key Parameters
+| Parameter | Purpose |
+|---------|--------|
+| `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
+| `--context-length 1010000` | Defines max input length |
+| `--mem-frac 0.75` | Allocates 75% GPU memory to KV cache (adjust based on hardware) |
+| `--tp 8` | Tensor parallelism size (matches model sharding) |
+| `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
+#### Long-Context Performance
+We test the model on an 1M version of the [RULER](https://arxiv.org/abs/2404.06654) benchmark.
+| Model Name                                  | Acc avg | 4k   | 8k   | 16k  | 32k  | 64k  | 96k  | 128k | 192k | 256k | 384k | 512k | 640k | 768k | 896k | 1000k |
+|---------------------------------------------|---------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|-------|
+| Qwen3-235B-A22B (Non-Thinking)              | 83.9    | 97.7 | 96.1 | 97.5 | 96.1 | 94.2 | 90.3 | 88.5 | 85.0 | 82.1 | 79.2 | 74.4 | 70.0 | 71.0 | 68.5 | 68.0  |
+| Qwen3-235B-A22B-Instruct-2507 (Full Attention)   | 92.5    | 98.5 | 97.6 | 96.9 | 97.3 | 95.8 | 94.9 | 93.9 | 94.5 | 91.0 | 92.2 | 90.9 | 87.8 | 84.8 | 86.5 | 84.5  |
+| Qwen3-235B-A22B-Instruct-2507 (Sparse Attention) | 91.7 | 98.5 | 97.2 | 97.3 | 97.7 | 96.6 | 94.6 | 92.8 | 94.3 | 90.5 | 89.7 | 89.5 | 86.4 | 83.6 | 84.2 | 82.5  |
+* All models are evaluated with Dual Chunk Attention enabled.
+* Since the evaluation is time-consuming, we use 260 samples for each length (13 sub-tasks, 20 samples for each).
 ## Best Practices
 To achieve optimal performance, we recommend the following settings:
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2505.09388},
 }
+@article{qwen2.5-1m,
+      title={Qwen2.5-1M Technical Report},
+      author={An Yang and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoyan Huang and Jiandong Jiang and Jianhong Tu and Jianwei Zhang and Jingren Zhou and Junyang Lin and Kai Dang and Kexin Yang and Le Yu and Mei Li and Minmin Sun and Qin Zhu and Rui Men and Tao He and Weijia Xu and Wenbiao Yin and Wenyuan Yu and Xiafei Qiu and Xingzhang Ren and Xinlong Yang and Yong Li and Zhiying Xu and Zipeng Zhang},
+      journal={arXiv preprint arXiv:2501.15383},
+      year={2025}
+}
 ```

config_1m.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json CHANGED Viewed

@@ -230,7 +230,7 @@
     "clean_up_tokenization_spaces": false,
     "eos_token": "<|im_end|>",
     "errors": "replace",
-    "model_max_length": 262144,
     "pad_token": "<|endoftext|>",
     "split_special_tokens": false,
     "tokenizer_class": "Qwen2Tokenizer",

     "clean_up_tokenization_spaces": false,
     "eos_token": "<|im_end|>",
     "errors": "replace",
+    "model_max_length": 1010000,
     "pad_token": "<|endoftext|>",
     "split_special_tokens": false,
     "tokenizer_class": "Qwen2Tokenizer",