File size: 7,433 Bytes
9c63453
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb4deb3
9c63453
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
---
library_name: transformers
license: bsd-3-clause
base_model:
- Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4 
tags:
- Qwen
- Qwen2.5-0.5B-Instruct
- Qwen2.5-0.5B-Instruct-GPTQ-Int4 
- GPTQ
- Int4
---

# Qwen2.5-0.5B-Instruct-GPTQ-Int4 

This version of Qwen2.5-0.5B-Instruct-GPTQ-Int4 has been converted to run on the Axera NPU using **w4a16** quantization.

This model has been optimized with the following LoRA: 

Compatible with Pulsar2 version: 4.2(Not released yet)

## Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4

[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) 

[AXera NPU LLM Runtime](https://github.com/AXERA-TECH/ax-llm) 

## Support Platform

- AX650
  - AX650N DEMO Board
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
- AX630C
  - *developing*
 
|Chips|w8a16|w4a16|
|--|--|--|
|AX650| 28 tokens/sec|44 tokens/sec|

## How to use

Download all files from this repository to the device

```
root@ax650:/mnt/qtang/llm-test/qwen2.5-0.5b# tree -L 1
.
├── qwen2.5-0.5b-gptq-int4-ax650
├── qwen2.5_tokenizer
├── qwen2.5_tokenizer.py
├── main_axcl_aarch64
├── main_axcl_x86
├── main_prefill
├── post_config.json
├── run_qwen2.5_0.5b_gptq_int4_ax650.sh
├── run_qwen2.5_0.5b_gptq_int4_axcl_aarch64.sh
└── run_qwen2.5_0.5b_gptq_int4_axcl_x86.sh
```

#### Start the Tokenizer service

```
root@ax650:/mnt/qtang/llm-test/qwen2.5-0.5b# python3 qwen2.5_tokenizer.py --port 12345
None None 151645 <|im_end|>
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
hello world<|im_end|>
<|im_start|>assistant

[151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 14990, 1879, 151645, 198, 151644, 77091, 198]
http://localhost:12345
```

#### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

Open another terminal and run `run_qwen2.5_0.5b_gptq_int4_ax650.sh`

```
root@ax650:/mnt/qtang/llm-test/qwen2.5-0.5b# ./run_qwen2.5_0.5b_gptq_int4_ax650.sh
[I][                            Init][ 125]: LLM init start
bos_id: -1, eos_id: 151645
  3% | ██                                |   1 /  27 [0.00s<0.08s, 333.33 count/s] tokenizer init ok
[I][                            Init][  26]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  27 /  27 [1.34s<1.34s, 20.10 count/s] init post axmodel ok,remain_cmm(3427 MB)
[I][                                       Init][ 241]: max_token_len : 1024
[I][                            Init][ 246]: kv_cache_size : 128, kv_cache_num: 1024
[I][                            Init][ 254]: prefill_token_num : 128
[I][                     load_config][ 281]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 268]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> who are you
[I][                             Run][ 466]: ttft: 134.66 ms
I am Qwen, a Qwen AI created by Alibaba Cloud. I am here to assist you with various topics and provide help to the best of my ability. I am here to help  
with any questions you have about science, technology, or any other topic you might have for help or guidance. I am always happy to help you!
[N][                             Run][ 605]: hit eos,avg 42.11 token/s

>> 1+1=?
[I][                             Run][ 466]: ttft: 135.07 ms
1+1=2
[N][                             Run][ 605]: hit eos,avg 43.04 token/s
```

#### Inference with M.2 Accelerator card

[What is M.2 Accelerator card?](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html), Show this DEMO based on Raspberry PI 5.

```
(base) axera@raspberrypi:~/samples/qwen2.5-0.5b $ ./run_qwen2.5_0.5b_gptq_int4_axcl_aarch64.sh
build time: Feb 13 2025 15:44:57
[I][                            Init][ 111]: LLM init start
bos_id: -1, eos_id: 151645
100% | ████████████████████████████████ |  27 /  27 [11.64s<11.64s, 2.32 count/s] init post axmodel okmain_cmm(6788 MB)
[I][                            Init][ 226]: max_token_len : 1024
[I][                            Init][ 231]: kv_cache_size : 128, kv_cache_num: 1024
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 288]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> who are you
I am Qwen, a Qwen-like language model created by Alibaba Cloud. I am designed to assist users in answering questions, generating text,
and participating in conversations. I am here to help you with your questions and to engage in meaningful exchanges with you.
If you have any questions, you can ask me, and if you want, you can even write to me!
[N][                             Run][ 610]: hit eos,avg 25.88 token/s

>> 1+1=?
1+1=2
[N][                             Run][ 610]: hit eos,avg 29.73 token/s
>> q

(base) axera@raspberrypi:~/samples/qwen2.5-0.5b $ axcl-smi
+------------------------------------------------------------------------------------------------+
| AXCL-SMI  V2.26.0_20250205130139                                Driver  V2.26.0_20250205130139 |
+-----------------------------------------+--------------+---------------------------------------+
| Card  Name                     Firmware | Bus-Id       |                          Memory-Usage |
| Fan   Temp                Pwr:Usage/Cap | CPU      NPU |                             CMM-Usage |
|=========================================+==============+=======================================|
|    0  AX650N                    V2.26.0 | 0000:01:00.0 |                170 MiB /      945 MiB |
|   --   43C                      -- / -- | 2%        0% |                392 MiB /     7040 MiB |
+-----------------------------------------+--------------+---------------------------------------+

+------------------------------------------------------------------------------------------------+
| Processes:                                                                                     |
| Card      PID  Process Name                                                   NPU Memory Usage |
|================================================================================================|
|    0   474440  /home/axera/samples/qwen2.5-0.5b-gptq-int4/main_axcl_aarch64         370172 KiB |
+------------------------------------------------------------------------------------------------+
(base) axera@raspberrypi:~ $
```