jerryzh168 commited on
Commit
eb836ac
·
verified ·
1 Parent(s): 4e0a20d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -10
README.md CHANGED
@@ -18,14 +18,14 @@ pipeline_tag: text-generation
18
  ---
19
 
20
 
21
- [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (8da4w).
22
  The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
23
 
24
- We provide the [quantized pte](https://huggingface.co/pytorch/Qwen3-4B-8da4w/blob/main/qwen3-4B-8da4w-1024-cxt.pte) for direct use in ExecuTorch.
25
  (The provided pte file is exported with a max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
26
 
27
  # Running in a mobile app
28
- The [pte file](https://huggingface.co/pytorch/Qwen3-4B-8da4w/blob/main/qwen3-4B-8da4w-1024-cxt.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
29
  On iPhone 15 Pro, the model runs at 14.8 tokens/sec and uses 3379 Mb of memory.
30
 
31
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66049fc71116cebd1d3bdcf4/eVHB7fVllmwVauKJvGu0d.png)
@@ -130,7 +130,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_id)
130
 
131
  # Push to hub
132
  MODEL_NAME = model_id.split("/")[-1]
133
- save_to = f"{USER_ID}/{MODEL_NAME}-8da4w"
134
  quantized_model.push_to_hub(save_to, safe_serialization=False)
135
  tokenizer.push_to_hub(save_to)
136
 
@@ -171,7 +171,7 @@ Hello! I'm Qwen, a large language model developed by Alibaba Cloud. While I don'
171
 
172
  | Benchmark | | |
173
  |----------------------------------|----------------|---------------------------|
174
- | | Qwen3-4B | Qwen3-4B-8da4w |
175
  | **Popular aggregated benchmark** | | |
176
  | mmlu | 68.38 | 66.74 |
177
  | mmlu_pro | 49.71 | 46.73 |
@@ -198,9 +198,9 @@ Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation
198
  lm_eval --model hf --model_args pretrained=Qwen3/Qwen3-4B --tasks mmlu --device cuda:0 --batch_size auto
199
  ```
200
 
201
- ## int8 dynamic activation and int4 weight quantization (8da4w)
202
  ```Shell
203
- lm_eval --model hf --model_args pretrained=pytorch/Qwen3-4B-8da4w --tasks mmlu --device cuda:0 --batch_size auto
204
  ```
205
  </details>
206
 
@@ -209,10 +209,10 @@ lm_eval --model hf --model_args pretrained=pytorch/Qwen3-4B-8da4w --tasks mmlu -
209
  We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
210
  Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
211
 
212
- We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Qwen3-4B-8da4w/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
213
- The following script does this for you. We have uploaded the converted checkpoint [pytorch_model_converted.bin](https://huggingface.co/pytorch/Qwen3-4B-8da4w/blob/main/pytorch_model_converted.bin) for convenience.
214
  ```Shell
215
- python -m executorch.examples.models.qwen3.convert_weights $(huggingface-cli download pytorch/Qwen3-4B-8da4w) pytorch_model_converted.bin
216
  ```
217
 
218
  Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
 
18
  ---
19
 
20
 
21
+ [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (INT8-INT4).
22
  The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
23
 
24
+ We provide the [quantized pte](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4/blob/main/qwen3-4B-8da4w-1024-cxt.pte) for direct use in ExecuTorch.
25
  (The provided pte file is exported with a max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
26
 
27
  # Running in a mobile app
28
+ The [pte file](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4/blob/main/qwen3-4B-8da4w-1024-cxt.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
29
  On iPhone 15 Pro, the model runs at 14.8 tokens/sec and uses 3379 Mb of memory.
30
 
31
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66049fc71116cebd1d3bdcf4/eVHB7fVllmwVauKJvGu0d.png)
 
130
 
131
  # Push to hub
132
  MODEL_NAME = model_id.split("/")[-1]
133
+ save_to = f"{USER_ID}/{MODEL_NAME}-INT8-INT4"
134
  quantized_model.push_to_hub(save_to, safe_serialization=False)
135
  tokenizer.push_to_hub(save_to)
136
 
 
171
 
172
  | Benchmark | | |
173
  |----------------------------------|----------------|---------------------------|
174
+ | | Qwen3-4B | Qwen3-4B-INT8-INT4 |
175
  | **Popular aggregated benchmark** | | |
176
  | mmlu | 68.38 | 66.74 |
177
  | mmlu_pro | 49.71 | 46.73 |
 
198
  lm_eval --model hf --model_args pretrained=Qwen3/Qwen3-4B --tasks mmlu --device cuda:0 --batch_size auto
199
  ```
200
 
201
+ ## int8 dynamic activation and int4 weight quantization (INT8-INT4)
202
  ```Shell
203
+ lm_eval --model hf --model_args pretrained=pytorch/Qwen3-4B-INT8-INT4 --tasks mmlu --device cuda:0 --batch_size auto
204
  ```
205
  </details>
206
 
 
209
  We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
210
  Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
211
 
212
+ We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
213
+ The following script does this for you. We have uploaded the converted checkpoint [pytorch_model_converted.bin](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4/blob/main/pytorch_model_converted.bin) for convenience.
214
  ```Shell
215
+ python -m executorch.examples.models.qwen3.convert_weights $(huggingface-cli download pytorch/Qwen3-4B-INT8-INT4) pytorch_model_converted.bin
216
  ```
217
 
218
  Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.