feihu.hf commited on
Commit
1c8249c
·
1 Parent(s): d199b9e

update README

Browse files
Files changed (1) hide show
  1. README.md +29 -19
README.md CHANGED
@@ -201,10 +201,15 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
201
 
202
  ### How to Enable 1M Token Context
203
 
 
 
 
204
  #### Step 1: Update Configuration File
205
 
206
  Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
207
 
 
 
208
  After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
209
 
210
  #### Option 1: Using vLLM
@@ -212,7 +217,9 @@ After updating the config, proceed with either **vLLM** or **SGLang** for servin
212
  To run Qwen with 1M context support:
213
 
214
  ```bash
215
- pip install vllm>=0.10.0
 
 
216
  ```
217
 
218
  Then launch the server with Dual Chunk Flash Attention enabled:
@@ -225,7 +232,8 @@ vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
225
  --enable-chunked-prefill \
226
  --max-num-batched-tokens 131072 \
227
  --enforce-eager \
228
- --max-num-seqs 1
 
229
  ```
230
 
231
  ##### Key Parameters
@@ -238,28 +246,14 @@ vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
238
  | `--max-num-batched-tokens 131072` | Controls batch size during prefill; balances throughput and memory |
239
  | `--enforce-eager` | Disables CUDA graph capture (required for dual chunk attention) |
240
  | `--max-num-seqs 1` | Limits concurrent sequences due to extreme memory usage |
241
-
242
- ##### Troubleshooting:
243
-
244
- 1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache."
245
-
246
- The VRAM reserved for the KV cache is insufficient. Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
247
-
248
- 2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
249
-
250
- The VRAM reserved for activation weights is insufficient. You can try setting ``gpu_memory_utilization`` to 0.85 or lower, but be aware that this might reduce the VRAM available for the KV cache.
251
-
252
- 3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager."
253
-
254
- The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len``.
255
-
256
 
257
  #### Option 2: Using SGLang
258
 
259
  First, clone and install the specialized branch:
260
 
261
  ```bash
262
- git clone -b qwen-1m-dca https://github.com/sgl-project/sglang.git
263
  cd sglang
264
  pip install -e "python[all]"
265
  ```
@@ -282,10 +276,26 @@ python3 -m sglang.launch_server \
282
  |---------|--------|
283
  | `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
284
  | `--context-length 1010000` | Defines max input length |
285
- | `--mem-frac 0.75` | Allocates 75% GPU memory to KV cache (adjust based on hardware) |
286
  | `--tp 8` | Tensor parallelism size (matches model sharding) |
287
  | `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
288
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
289
  #### Long-Context Performance
290
 
291
  We test the model on an 1M version of the [RULER](https://arxiv.org/abs/2404.06654) benchmark.
 
201
 
202
  ### How to Enable 1M Token Context
203
 
204
+ > [!NOTE]
205
+ > To effectively process a 1 million token context, users will require approximately **1000 GB** of total GPU memory. This accounts for model weights, KV-cache storage, and peak activation memory demands.
206
+
207
  #### Step 1: Update Configuration File
208
 
209
  Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
210
 
211
+ #### Step 2: Start Model Server
212
+
213
  After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
214
 
215
  #### Option 1: Using vLLM
 
217
  To run Qwen with 1M context support:
218
 
219
  ```bash
220
+ git clone https://github.com/vllm-project/vllm.git
221
+ cd vllm
222
+ pip install -e .
223
  ```
224
 
225
  Then launch the server with Dual Chunk Flash Attention enabled:
 
232
  --enable-chunked-prefill \
233
  --max-num-batched-tokens 131072 \
234
  --enforce-eager \
235
+ --max-num-seqs 1 \
236
+ --gpu-memory-utilization 0.85
237
  ```
238
 
239
  ##### Key Parameters
 
246
  | `--max-num-batched-tokens 131072` | Controls batch size during prefill; balances throughput and memory |
247
  | `--enforce-eager` | Disables CUDA graph capture (required for dual chunk attention) |
248
  | `--max-num-seqs 1` | Limits concurrent sequences due to extreme memory usage |
249
+ | `--gpu-memory-utilization 0.85` | Set the fraction of GPU memory to be used for the model executor |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
250
 
251
  #### Option 2: Using SGLang
252
 
253
  First, clone and install the specialized branch:
254
 
255
  ```bash
256
+ git clone https://github.com/sgl-project/sglang.git
257
  cd sglang
258
  pip install -e "python[all]"
259
  ```
 
276
  |---------|--------|
277
  | `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
278
  | `--context-length 1010000` | Defines max input length |
279
+ | `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
280
  | `--tp 8` | Tensor parallelism size (matches model sharding) |
281
  | `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
282
 
283
+ #### Troubleshooting:
284
+
285
+ 1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache."
286
+
287
+ The VRAM reserved for the KV cache is insufficient.
288
+ - vLLM: Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
289
+ - SGLang: Consider reducing the ``context-length`` or increasing the ``tp``. Alternatively, you can reduce ``chunked-prefill-size``, although this may significantly slow down inference.
290
+
291
+ 2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
292
+
293
+ The VRAM reserved for activation weights is insufficient. You can try lowering ``gpu_memory_utilization`` or ``mem-frac``, but be aware that this might reduce the VRAM available for the KV cache.
294
+
295
+ 3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager."
296
+
297
+ The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len`` or ``context-length``.
298
+
299
  #### Long-Context Performance
300
 
301
  We test the model on an 1M version of the [RULER](https://arxiv.org/abs/2404.06654) benchmark.