Update README.md
Browse files
README.md
CHANGED
|
@@ -18,14 +18,14 @@ pipeline_tag: text-generation
|
|
| 18 |
---
|
| 19 |
|
| 20 |
|
| 21 |
-
[Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (
|
| 22 |
The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
|
| 23 |
|
| 24 |
-
We provide the [quantized pte](https://huggingface.co/pytorch/Qwen3-4B-
|
| 25 |
(The provided pte file is exported with a max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
|
| 26 |
|
| 27 |
# Running in a mobile app
|
| 28 |
-
The [pte file](https://huggingface.co/pytorch/Qwen3-4B-
|
| 29 |
On iPhone 15 Pro, the model runs at 14.8 tokens/sec and uses 3379 Mb of memory.
|
| 30 |
|
| 31 |

|
|
@@ -130,7 +130,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|
| 130 |
|
| 131 |
# Push to hub
|
| 132 |
MODEL_NAME = model_id.split("/")[-1]
|
| 133 |
-
save_to = f"{USER_ID}/{MODEL_NAME}-
|
| 134 |
quantized_model.push_to_hub(save_to, safe_serialization=False)
|
| 135 |
tokenizer.push_to_hub(save_to)
|
| 136 |
|
|
@@ -171,7 +171,7 @@ Hello! I'm Qwen, a large language model developed by Alibaba Cloud. While I don'
|
|
| 171 |
|
| 172 |
| Benchmark | | |
|
| 173 |
|----------------------------------|----------------|---------------------------|
|
| 174 |
-
| | Qwen3-4B | Qwen3-4B-
|
| 175 |
| **Popular aggregated benchmark** | | |
|
| 176 |
| mmlu | 68.38 | 66.74 |
|
| 177 |
| mmlu_pro | 49.71 | 46.73 |
|
|
@@ -198,9 +198,9 @@ Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation
|
|
| 198 |
lm_eval --model hf --model_args pretrained=Qwen3/Qwen3-4B --tasks mmlu --device cuda:0 --batch_size auto
|
| 199 |
```
|
| 200 |
|
| 201 |
-
## int8 dynamic activation and int4 weight quantization (
|
| 202 |
```Shell
|
| 203 |
-
lm_eval --model hf --model_args pretrained=pytorch/Qwen3-4B-
|
| 204 |
```
|
| 205 |
</details>
|
| 206 |
|
|
@@ -209,10 +209,10 @@ lm_eval --model hf --model_args pretrained=pytorch/Qwen3-4B-8da4w --tasks mmlu -
|
|
| 209 |
We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
|
| 210 |
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
|
| 211 |
|
| 212 |
-
We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Qwen3-4B-
|
| 213 |
-
The following script does this for you. We have uploaded the converted checkpoint [pytorch_model_converted.bin](https://huggingface.co/pytorch/Qwen3-4B-
|
| 214 |
```Shell
|
| 215 |
-
python -m executorch.examples.models.qwen3.convert_weights $(huggingface-cli download pytorch/Qwen3-4B-
|
| 216 |
```
|
| 217 |
|
| 218 |
Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
|
|
|
|
| 18 |
---
|
| 19 |
|
| 20 |
|
| 21 |
+
[Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (INT8-INT4).
|
| 22 |
The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
|
| 23 |
|
| 24 |
+
We provide the [quantized pte](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4/blob/main/qwen3-4B-8da4w-1024-cxt.pte) for direct use in ExecuTorch.
|
| 25 |
(The provided pte file is exported with a max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
|
| 26 |
|
| 27 |
# Running in a mobile app
|
| 28 |
+
The [pte file](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4/blob/main/qwen3-4B-8da4w-1024-cxt.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
|
| 29 |
On iPhone 15 Pro, the model runs at 14.8 tokens/sec and uses 3379 Mb of memory.
|
| 30 |
|
| 31 |

|
|
|
|
| 130 |
|
| 131 |
# Push to hub
|
| 132 |
MODEL_NAME = model_id.split("/")[-1]
|
| 133 |
+
save_to = f"{USER_ID}/{MODEL_NAME}-INT8-INT4"
|
| 134 |
quantized_model.push_to_hub(save_to, safe_serialization=False)
|
| 135 |
tokenizer.push_to_hub(save_to)
|
| 136 |
|
|
|
|
| 171 |
|
| 172 |
| Benchmark | | |
|
| 173 |
|----------------------------------|----------------|---------------------------|
|
| 174 |
+
| | Qwen3-4B | Qwen3-4B-INT8-INT4 |
|
| 175 |
| **Popular aggregated benchmark** | | |
|
| 176 |
| mmlu | 68.38 | 66.74 |
|
| 177 |
| mmlu_pro | 49.71 | 46.73 |
|
|
|
|
| 198 |
lm_eval --model hf --model_args pretrained=Qwen3/Qwen3-4B --tasks mmlu --device cuda:0 --batch_size auto
|
| 199 |
```
|
| 200 |
|
| 201 |
+
## int8 dynamic activation and int4 weight quantization (INT8-INT4)
|
| 202 |
```Shell
|
| 203 |
+
lm_eval --model hf --model_args pretrained=pytorch/Qwen3-4B-INT8-INT4 --tasks mmlu --device cuda:0 --batch_size auto
|
| 204 |
```
|
| 205 |
</details>
|
| 206 |
|
|
|
|
| 209 |
We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
|
| 210 |
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
|
| 211 |
|
| 212 |
+
We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
|
| 213 |
+
The following script does this for you. We have uploaded the converted checkpoint [pytorch_model_converted.bin](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4/blob/main/pytorch_model_converted.bin) for convenience.
|
| 214 |
```Shell
|
| 215 |
+
python -m executorch.examples.models.qwen3.convert_weights $(huggingface-cli download pytorch/Qwen3-4B-INT8-INT4) pytorch_model_converted.bin
|
| 216 |
```
|
| 217 |
|
| 218 |
Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
|