metascroy commited on
Commit
3d69139
·
verified ·
1 Parent(s): efabc4c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -24,7 +24,7 @@ We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruc
24
  (The provided pte file is exported with at max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
25
 
26
  # Running in a mobile app
27
- The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/phi4-mini-INT8-INT4.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
28
  On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
29
 
30
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66049fc71116cebd1d3bdcf4/521rXwIlYS9HIAEBAPJjw.png)
@@ -120,10 +120,10 @@ linear_config = Int8DynamicActivationIntxWeightConfig(
120
  weight_scale_dtype=torch.bfloat16,
121
  )
122
  quant_config = ModuleFqnToConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
123
- quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])
124
 
125
  # either use `untied_model_id` or `untied_model_local_path`
126
- quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
127
  tokenizer = AutoTokenizer.from_pretrained(model_id)
128
 
129
  # Push to hub
@@ -212,15 +212,15 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-INT8-INT4
212
  We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
213
  Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
214
 
215
- We first convert the [quantized checkpoint](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/pytorch_model.bin) to one ExecuTorch's LLM export script expects by renaming some of the checkpoint keys.
216
- The following script does this for you.
217
  ```Shell
218
- HF_MODEL_DIR=$(hf download pytorch/Phi-4-mini-instruct-INT8-INT4)
219
- python -m executorch.examples.models.phi_4_mini.convert_weights $HF_MODEL_DIR pytorch_model_converted.bin
220
  ```
221
 
222
- Once the checkpoint is converted, we can export to ExecuTorch's pte format with the XNNPACK delegate.
223
- The below command exports with a max_seq_length/max_context_length of 1024, but it can be changed as desired.
 
224
 
225
  ```Shell
226
  python-m executorch.examples.models.llama.export_llama \
 
24
  (The provided pte file is exported with at max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
25
 
26
  # Running in a mobile app
27
+ The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4/blob/main/model.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
28
  On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
29
 
30
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66049fc71116cebd1d3bdcf4/521rXwIlYS9HIAEBAPJjw.png)
 
120
  weight_scale_dtype=torch.bfloat16,
121
  )
122
  quant_config = ModuleFqnToConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
123
+ quantization_config = TorchAoConfig(quant_type=quant_config, include_input_output_embeddings=True, modules_to_not_convert=[])
124
 
125
  # either use `untied_model_id` or `untied_model_local_path`
126
+ quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
127
  tokenizer = AutoTokenizer.from_pretrained(model_id)
128
 
129
  # Push to hub
 
212
  We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
213
  Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
214
 
215
+ ExecuTorch's LLM export scripts require the checkpoint keys and parameters have certain names, which differ from those used in Hugging Face.
216
+ So we first use a conversion script that converts the Hugging Face checkpoint key names to ones that ExecuTorch expects:
217
  ```Shell
218
+ python -m executorch.examples.models.phi_4_mini.convert_weights $(hf download pytorch/Phi-4-mini-instruct-INT8-INT4) pytorch_model_converted.bin
 
219
  ```
220
 
221
+ Once we have the checkpoint, we export it to ExecuTorch with the XNNPACK backend as follows with a max_seq_length/max_context_length of 1024.
222
+
223
+ (Note: ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at examples/models/phi_4_mini/config/config.json within the ExecuTorch repo.)
224
 
225
  ```Shell
226
  python-m executorch.examples.models.llama.export_llama \