Update README.md
Browse files
README.md
CHANGED
@@ -18,14 +18,14 @@ pipeline_tag: text-generation
|
|
18 |
---
|
19 |
|
20 |
|
21 |
-
[Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (
|
22 |
The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
|
23 |
|
24 |
-
We provide the [quantized pte](https://huggingface.co/pytorch/Qwen3-4B-
|
25 |
(The provided pte file is exported with a max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
|
26 |
|
27 |
# Running in a mobile app
|
28 |
-
The [pte file](https://huggingface.co/pytorch/Qwen3-4B-
|
29 |
On iPhone 15 Pro, the model runs at 14.8 tokens/sec and uses 3379 Mb of memory.
|
30 |
|
31 |

|
@@ -130,7 +130,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|
130 |
|
131 |
# Push to hub
|
132 |
MODEL_NAME = model_id.split("/")[-1]
|
133 |
-
save_to = f"{USER_ID}/{MODEL_NAME}-
|
134 |
quantized_model.push_to_hub(save_to, safe_serialization=False)
|
135 |
tokenizer.push_to_hub(save_to)
|
136 |
|
@@ -171,7 +171,7 @@ Hello! I'm Qwen, a large language model developed by Alibaba Cloud. While I don'
|
|
171 |
|
172 |
| Benchmark | | |
|
173 |
|----------------------------------|----------------|---------------------------|
|
174 |
-
| | Qwen3-4B | Qwen3-4B-
|
175 |
| **Popular aggregated benchmark** | | |
|
176 |
| mmlu | 68.38 | 66.74 |
|
177 |
| mmlu_pro | 49.71 | 46.73 |
|
@@ -198,9 +198,9 @@ Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation
|
|
198 |
lm_eval --model hf --model_args pretrained=Qwen3/Qwen3-4B --tasks mmlu --device cuda:0 --batch_size auto
|
199 |
```
|
200 |
|
201 |
-
## int8 dynamic activation and int4 weight quantization (
|
202 |
```Shell
|
203 |
-
lm_eval --model hf --model_args pretrained=pytorch/Qwen3-4B-
|
204 |
```
|
205 |
</details>
|
206 |
|
@@ -209,10 +209,10 @@ lm_eval --model hf --model_args pretrained=pytorch/Qwen3-4B-8da4w --tasks mmlu -
|
|
209 |
We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
|
210 |
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
|
211 |
|
212 |
-
We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Qwen3-4B-
|
213 |
-
The following script does this for you. We have uploaded the converted checkpoint [pytorch_model_converted.bin](https://huggingface.co/pytorch/Qwen3-4B-
|
214 |
```Shell
|
215 |
-
python -m executorch.examples.models.qwen3.convert_weights $(huggingface-cli download pytorch/Qwen3-4B-
|
216 |
```
|
217 |
|
218 |
Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
|
|
|
18 |
---
|
19 |
|
20 |
|
21 |
+
[Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (INT8-INT4).
|
22 |
The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
|
23 |
|
24 |
+
We provide the [quantized pte](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4/blob/main/qwen3-4B-8da4w-1024-cxt.pte) for direct use in ExecuTorch.
|
25 |
(The provided pte file is exported with a max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
|
26 |
|
27 |
# Running in a mobile app
|
28 |
+
The [pte file](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4/blob/main/qwen3-4B-8da4w-1024-cxt.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
|
29 |
On iPhone 15 Pro, the model runs at 14.8 tokens/sec and uses 3379 Mb of memory.
|
30 |
|
31 |

|
|
|
130 |
|
131 |
# Push to hub
|
132 |
MODEL_NAME = model_id.split("/")[-1]
|
133 |
+
save_to = f"{USER_ID}/{MODEL_NAME}-INT8-INT4"
|
134 |
quantized_model.push_to_hub(save_to, safe_serialization=False)
|
135 |
tokenizer.push_to_hub(save_to)
|
136 |
|
|
|
171 |
|
172 |
| Benchmark | | |
|
173 |
|----------------------------------|----------------|---------------------------|
|
174 |
+
| | Qwen3-4B | Qwen3-4B-INT8-INT4 |
|
175 |
| **Popular aggregated benchmark** | | |
|
176 |
| mmlu | 68.38 | 66.74 |
|
177 |
| mmlu_pro | 49.71 | 46.73 |
|
|
|
198 |
lm_eval --model hf --model_args pretrained=Qwen3/Qwen3-4B --tasks mmlu --device cuda:0 --batch_size auto
|
199 |
```
|
200 |
|
201 |
+
## int8 dynamic activation and int4 weight quantization (INT8-INT4)
|
202 |
```Shell
|
203 |
+
lm_eval --model hf --model_args pretrained=pytorch/Qwen3-4B-INT8-INT4 --tasks mmlu --device cuda:0 --batch_size auto
|
204 |
```
|
205 |
</details>
|
206 |
|
|
|
209 |
We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
|
210 |
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
|
211 |
|
212 |
+
We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
|
213 |
+
The following script does this for you. We have uploaded the converted checkpoint [pytorch_model_converted.bin](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4/blob/main/pytorch_model_converted.bin) for convenience.
|
214 |
```Shell
|
215 |
+
python -m executorch.examples.models.qwen3.convert_weights $(huggingface-cli download pytorch/Qwen3-4B-INT8-INT4) pytorch_model_converted.bin
|
216 |
```
|
217 |
|
218 |
Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
|