Qwen
/

Qwen3-235B-A22B-Instruct-2507

@@ -201,10 +201,15 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
 ### How to Enable 1M Token Context
 #### Step 1: Update Configuration File
 Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
 After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
 #### Option 1: Using vLLM
@@ -212,7 +217,9 @@ After updating the config, proceed with either **vLLM** or **SGLang** for servin
 To run Qwen with 1M context support:
 ```bash
-pip install vllm>=0.10.0
 ```
 Then launch the server with Dual Chunk Flash Attention enabled:
@@ -225,7 +232,8 @@ vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
   --enable-chunked-prefill \
   --max-num-batched-tokens 131072 \
   --enforce-eager \
-  --max-num-seqs 1
 ```
 ##### Key Parameters
@@ -238,28 +246,14 @@ vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
 | `--max-num-batched-tokens 131072` | Controls batch size during prefill; balances throughput and memory |
 | `--enforce-eager` | Disables CUDA graph capture (required for dual chunk attention) |
 | `--max-num-seqs 1` | Limits concurrent sequences due to extreme memory usage |
-##### Troubleshooting:
-1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache."
-    The VRAM reserved for the KV cache is insufficient. Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
-2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
-    The VRAM reserved for activation weights is insufficient. You can try setting ``gpu_memory_utilization`` to 0.85 or lower, but be aware that this might reduce the VRAM available for the KV cache.
-3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager."
-    The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len``.
 #### Option 2: Using SGLang
 First, clone and install the specialized branch:
 ```bash
-git clone -b qwen-1m-dca https://github.com/sgl-project/sglang.git
 cd sglang
 pip install -e "python[all]"
 ```
@@ -282,10 +276,26 @@ python3 -m sglang.launch_server \
 |---------|--------|
 | `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
 | `--context-length 1010000` | Defines max input length |
-| `--mem-frac 0.75` | Allocates 75% GPU memory to KV cache (adjust based on hardware) |
 | `--tp 8` | Tensor parallelism size (matches model sharding) |
 | `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
 #### Long-Context Performance
 We test the model on an 1M version of the [RULER](https://arxiv.org/abs/2404.06654) benchmark.

 ### How to Enable 1M Token Context
+> [!NOTE]
+> To effectively process a 1 million token context, users will require approximately **1000 GB** of total GPU memory. This accounts for model weights, KV-cache storage, and peak activation memory demands.
 #### Step 1: Update Configuration File
 Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
+#### Step 2: Start Model Server
 After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
 #### Option 1: Using vLLM
 To run Qwen with 1M context support:
 ```bash
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+pip install -e .
 ```
 Then launch the server with Dual Chunk Flash Attention enabled:
   --enable-chunked-prefill \
   --max-num-batched-tokens 131072 \
   --enforce-eager \
+  --max-num-seqs 1 \
+  --gpu-memory-utilization 0.85
 ```
 ##### Key Parameters
 | `--max-num-batched-tokens 131072` | Controls batch size during prefill; balances throughput and memory |
 | `--enforce-eager` | Disables CUDA graph capture (required for dual chunk attention) |
 | `--max-num-seqs 1` | Limits concurrent sequences due to extreme memory usage |
+| `--gpu-memory-utilization 0.85` | Set the fraction of GPU memory to be used for the model executor |
 #### Option 2: Using SGLang
 First, clone and install the specialized branch:
 ```bash
+git clone https://github.com/sgl-project/sglang.git
 cd sglang
 pip install -e "python[all]"
 ```
 |---------|--------|
 | `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
 | `--context-length 1010000` | Defines max input length |
+| `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
 | `--tp 8` | Tensor parallelism size (matches model sharding) |
 | `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
+#### Troubleshooting:
+1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache."
+    The VRAM reserved for the KV cache is insufficient.
+    - vLLM: Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
+    - SGLang: Consider reducing the ``context-length`` or increasing the ``tp``. Alternatively, you can reduce ``chunked-prefill-size``, although this may significantly slow down inference.
+2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
+    The VRAM reserved for activation weights is insufficient. You can try lowering ``gpu_memory_utilization`` or ``mem-frac``, but be aware that this might reduce the VRAM available for the KV cache.
+3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager."
+    The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len`` or ``context-length``.
 #### Long-Context Performance
 We test the model on an 1M version of the [RULER](https://arxiv.org/abs/2404.06654) benchmark.