Update README.md
Browse files
README.md
CHANGED
@@ -20,8 +20,8 @@ pipeline_tag: text-generation
|
|
20 |
[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (INT8-INT4).
|
21 |
The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
|
22 |
|
23 |
-
We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/
|
24 |
-
(The provided pte file is exported with
|
25 |
|
26 |
# Running in a mobile app
|
27 |
The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/phi4-mini-INT8-INT4.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
|
@@ -213,27 +213,29 @@ We can run the quantized model on a mobile phone using [ExecuTorch](https://gith
|
|
213 |
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
|
214 |
|
215 |
We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
|
216 |
-
The following script does this for you.
|
217 |
```Shell
|
218 |
-
|
|
|
219 |
```
|
220 |
|
221 |
Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
|
222 |
-
The below command exports with a max_seq_length/max_context_length of
|
223 |
|
224 |
```Shell
|
225 |
-
|
226 |
-
python -m executorch.examples.models.llama.export_llama \
|
227 |
--model "phi_4_mini" \
|
228 |
-
--checkpoint
|
229 |
-
--params
|
|
|
230 |
-kv \
|
231 |
--use_sdpa_with_kv_cache \
|
232 |
-X \
|
233 |
-
--
|
234 |
-
--
|
235 |
-
--
|
236 |
-
--
|
|
|
237 |
```
|
238 |
|
239 |
After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).
|
|
|
20 |
[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (INT8-INT4).
|
21 |
The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
|
22 |
|
23 |
+
We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/model.pte) for direct use in ExecuTorch.
|
24 |
+
(The provided pte file is exported with at max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
|
25 |
|
26 |
# Running in a mobile app
|
27 |
The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/phi4-mini-INT8-INT4.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
|
|
|
213 |
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
|
214 |
|
215 |
We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
|
216 |
+
The following script does this for you.
|
217 |
```Shell
|
218 |
+
HF_MODEL_DIR=$(hf download pytorch/Phi-4-mini-instruct-INT8-INT4)
|
219 |
+
python -m executorch.examples.models.phi_4_mini.convert_weights $HF_MODEL_DIR pytorch_model_converted.bin
|
220 |
```
|
221 |
|
222 |
Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
|
223 |
+
The below command exports with a max_seq_length/max_context_length of 1024, but it can be changed as desired.
|
224 |
|
225 |
```Shell
|
226 |
+
python-m executorch.examples.models.llama.export_llama \
|
|
|
227 |
--model "phi_4_mini" \
|
228 |
+
--checkpoint pytorch_model_converted.bin \
|
229 |
+
--params examples/models/phi_4_mini/config/config.json \
|
230 |
+
--output_name model.pte \
|
231 |
-kv \
|
232 |
--use_sdpa_with_kv_cache \
|
233 |
-X \
|
234 |
+
--xnnpack-extended-ops \
|
235 |
+
--max_context_length 1024 \
|
236 |
+
--max_seq_length 1024 \
|
237 |
+
--dtype fp32 \
|
238 |
+
--metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}'
|
239 |
```
|
240 |
|
241 |
After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).
|