metascroy commited on
Commit
b41df82
·
verified ·
1 Parent(s): 692e809

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -13
README.md CHANGED
@@ -20,8 +20,8 @@ pipeline_tag: text-generation
20
  [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (INT8-INT4).
21
  The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
22
 
23
- We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/phi4-mini-INT8-INT4.pte) for direct use in ExecuTorch.
24
- (The provided pte file is exported with the default max_seq_length/max_context_length of 128; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
25
 
26
  # Running in a mobile app
27
  The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/phi4-mini-INT8-INT4.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
@@ -213,27 +213,29 @@ We can run the quantized model on a mobile phone using [ExecuTorch](https://gith
213
  Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
214
 
215
  We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
216
- The following script does this for you. We have uploaded the converted checkpoint [pytorch_model_converted.bin](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/pytorch_model_converted.bin) for convenience.
217
  ```Shell
218
- python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin pytorch_model_converted.bin
 
219
  ```
220
 
221
  Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
222
- The below command exports with a max_seq_length/max_context_length of 128, the default value, but it can be changed as desired.
223
 
224
  ```Shell
225
- PARAMS="executorch/examples/models/phi_4_mini/config.json"
226
- python -m executorch.examples.models.llama.export_llama \
227
  --model "phi_4_mini" \
228
- --checkpoint "pytorch_model_converted.bin" \
229
- --params "$PARAMS" \
 
230
  -kv \
231
  --use_sdpa_with_kv_cache \
232
  -X \
233
- --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}' \
234
- --max_seq_length 128 \
235
- --max_context_length 128 \
236
- --output_name="phi4-mini-INT8-INT4.pte"
 
237
  ```
238
 
239
  After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).
 
20
  [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (INT8-INT4).
21
  The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
22
 
23
+ We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/model.pte) for direct use in ExecuTorch.
24
+ (The provided pte file is exported with at max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
25
 
26
  # Running in a mobile app
27
  The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/phi4-mini-INT8-INT4.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
 
213
  Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
214
 
215
  We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
216
+ The following script does this for you.
217
  ```Shell
218
+ HF_MODEL_DIR=$(hf download pytorch/Phi-4-mini-instruct-INT8-INT4)
219
+ python -m executorch.examples.models.phi_4_mini.convert_weights $HF_MODEL_DIR pytorch_model_converted.bin
220
  ```
221
 
222
  Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
223
+ The below command exports with a max_seq_length/max_context_length of 1024, but it can be changed as desired.
224
 
225
  ```Shell
226
+ python-m executorch.examples.models.llama.export_llama \
 
227
  --model "phi_4_mini" \
228
+ --checkpoint pytorch_model_converted.bin \
229
+ --params examples/models/phi_4_mini/config/config.json \
230
+ --output_name model.pte \
231
  -kv \
232
  --use_sdpa_with_kv_cache \
233
  -X \
234
+ --xnnpack-extended-ops \
235
+ --max_context_length 1024 \
236
+ --max_seq_length 1024 \
237
+ --dtype fp32 \
238
+ --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}'
239
  ```
240
 
241
  After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).