File size: 6,826 Bytes
e01b8f5 e9c8b90 e01b8f5 69c58ea e01b8f5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
---
library_name: transformers
license: bsd-3-clause
base_model:
- Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
tags:
- Qwen
- Qwen2.5-7B-Instruct
- Qwen2.5-7B-Instruct-GPTQ-Int4
- GPTQ
- Int4
---
# Qwen2.5-7B-Instruct-GPTQ-Int4
This version of Qwen2.5-7B-Instruct-GPTQ-Int4 has been converted to run on the Axera NPU using **w4a16** quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 3.4(Not released yet)
## Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
[AXera NPU LLM Runtime](https://github.com/AXERA-TECH/ax-llm)
## Support Platform
- AX650
- AX650N DEMO Board
- [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
|Chips|w8a16|w4a16|
|--|--|--|
|AX650| 2.6 tokens/sec|4.8 tokens/sec|
## How to use
Download all files from this repository to the device
```
root@ax650:/mnt/qtang/llm-test/qwen2.5-7b# tree -L 1
.
├── qwen2.5-7b-gptq-int4-ax650
├── qwen2.5_tokenizer
├── qwen2.5_tokenizer.py
├── main_axcl_aarch64
├── main_axcl_x86
├── main_prefill
├── post_config.json
├── run_qwen2.5_7b_gptq_int4_ax650.sh
├── run_qwen2.5_7b_gptq_int4_axcl_aarch64.sh
└── run_qwen2.5_7b_gptq_int4_axcl_x86.sh
```
#### Start the Tokenizer service
```
root@ax650:/mnt/qtang/llm-test/qwen2.5-7b# python qwen2.5_tokenizer.py --port 12345
None None 151645 <|im_end|>
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
hello world<|im_end|>
<|im_start|>assistant
[151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 14990,
http://localhost:12345
```
#### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
Open another terminal and run `run_qwen2.5_7b_gptq_int4_ax650.sh`
```
root@ax650:/mnt/qtang/llm-test/qwen2.5-7b# ./run_qwen2.5_7b_gptq_int4_ax650.sh
[I][ Init][ 125]: LLM init start
bos_id: -1, eos_id: 151645
3% | ██ | 1 / 31 [0.00s<0.09s, 333.33 count/s] tokenizer init ok
100% | ████████████████████████████████ | 31 / 31 [45.25s<45.25s, 0.69 count/s] init post axmodel ok,remain_cmm(7664 MB)[I][
[I][ Init][ 246]: kv_cache_size : 512, kv_cache_num: 1024
[I][ Init][ 254]: prefill_token_num : 128
[I][ load_config][ 281]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": true,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 268]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> 1+1=?
[I][ Run][ 466]: ttft: 1138.88 ms
1+1 equals 2.
[N][ Run][ 605]: hit eos,avg 4.65 token/s
>> who are you
[I][ Run][ 466]: ttft: 1137.90 ms
I'm Qwen, a large language model created by Alibaba Cloud. How can I assist you today?
[N][ Run][ 605]: hit eos,avg 4.52 token/s
```
#### Inference with M.2 Accelerator card
[What is M.2 Accelerator card?](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html), Show this DEMO based on Raspberry PI 5.
```
(base) axera@raspberrypi:~/samples/qwen2.5-7b $ ./run_qwen2.5_7b_gptq_int4_axcl_aarch64.sh
build time: Feb 13 2025 15:15:07
[I][ Init][ 111]: LLM init start
bos_id: -1, eos_id: 151645
100% | ████████████████████████████████ | 31 / 31 [67.43s<67.43s, 0.46 count/s] init post axmodel okremain_cmm(2739 MB)
[I][ Init][ 226]: max_token_len : 1024
[I][ Init][ 231]: kv_cache_size : 512, kv_cache_num: 1024
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": true,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 288]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> who are you
I am Qwen, a large language model created by Alibaba Cloud. I'm here to help you with any questions or tasks you might have!
[N][ Run][ 610]: hit eos,avg 4.33 token/s
>> 1+1=?
1+1 equals 2.
[N][ Run][ 610]: hit eos,avg 4.54 token/s
>> q
(base) axera@raspberrypi:~ $ axcl-smi
+------------------------------------------------------------------------------------------------+
| AXCL-SMI V2.26.0_20250206225448 Driver V2.26.0_20250206225448 |
+-----------------------------------------+--------------+---------------------------------------+
| Card Name Firmware | Bus-Id | Memory-Usage |
| Fan Temp Pwr:Usage/Cap | CPU NPU | CMM-Usage |
|=========================================+==============+=======================================|
+-----------------------------------------+--------------+---------------------------------------+
| 0 AX650N V2.26.0 | 0000:05:00.0 | 175 MiB / 945 MiB |
| -- 61C -- / -- | 0% 0% | 4301 MiB / 7040 MiB |
+-----------------------------------------+--------------+---------------------------------------+
+------------------------------------------------------------------------------------------------+
| Processes: |
| Card PID Process Name NPU Memory Usage |
|================================================================================================|
| 0 63118 /home/axera/samples/qwen2.5-7b-gptq-int4/main_axcl_aarch64 4316448 KiB |
+------------------------------------------------------------------------------------------------+
```
|