File size: 4,140 Bytes
99a982d 19a81da 99a982d 19a81da 99a982d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
---
license: mit
pipeline_tag: text-generation
tags: [ONNX, DML, ONNXRuntime, phi3, nlp, conversational, custom_code]
inference: false
---
# Phi-3 Mini-4K-Instruct ONNX models
<!-- Provide a quick summary of what the model is/does. -->
This repository hosts the optimized versions of [Phi-3-mini-4k-instruct](https://aka.ms/phi3-mini-4k-instruct) to accelerate inference with ONNX Runtime.
Phi-3 Mini is a lightweight, state-of-the-art open model built upon datasets used for Phi-2 - synthetic data and filtered websites - with a focus on very high-quality, reasoning dense data. The model belongs to the Phi-3 model family, and the mini version comes in two variants: 4K and 128K which is the context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.
Optimized Phi-3 Mini models are published here in [ONNX](https://onnx.ai) format to run with [ONNX Runtime](https://onnxruntime.ai/) on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these targets.
[DirectML](https://aka.ms/directml) support lets developers bring hardware acceleration to Windows devices at scale across AMD, Intel, and NVIDIA GPUs. Along with DirectML, ONNX Runtime provides cross platform support for Phi-3 Mini across a range of devices for CPU, GPU, and mobile.
To easily get started with Phi-3, you can use our newly introduced ONNX Runtime Generate() API. See [here](https://aka.ms/generate-tutorial) for instructions on how to run it.
## ONNX Models
Here are some of the optimized configurations we have added:
1. ONNX model for int4 DML: ONNX model for AMD, Intel, and NVIDIA GPUs on Windows, quantized to int4 using [AWQ](https://arxiv.org/abs/2306.00978).
2. ONNX model for fp16 CUDA: ONNX model you can use to run for your NVIDIA GPUs.
3. ONNX model for int4 CUDA: ONNX model for NVIDIA GPUs using int4 quantization via RTN.
4. ONNX model for int4 CPU and Mobile: ONNX model for CPU and mobile using int4 quantization via RTN. There are two versions uploaded to balance latency vs. accuracy.
Acc=1 is targeted at improved accuracy, while Acc=4 is for improved perf. For mobile devices, we recommend using the model with acc-level-4.
## Hardware Supported
The models are tested on:
- GPU SKU: RTX 4090 (DirectML)
- GPU SKU: 1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4 (CUDA)
- CPU SKU: Standard F64s v2 (64 vcpus, 128 GiB memory)
- Mobile SKU: Samsung Galaxy S21
Minimum Configuration Required:
- Windows: DirectX 12-capable GPU and a minimum of 4GB of combined RAM
- CUDA: NVIDIA GPU with [Compute Capability](https://developer.nvidia.com/cuda-gpus) >= 7.0
### Model Description
- **Developed by:** Microsoft
- **Model type:** ONNX
- **Language(s) (NLP):** Python, C, C++
- **License:** MIT
- **Model Description:** This is a conversion of the Phi-3 Mini-4K-Instruct model for ONNX Runtime inference.
## Additional Details
- [**ONNX Runtime Optimizations Blog Link**](https://aka.ms/phi3-optimizations)
- [**Phi-3 Model Blog Link**](https://aka.ms/phi3blog-april)
- [**Phi-3 Model Card**]( https://aka.ms/phi3-mini-4k-instruct)
- [**Phi-3 Technical Report**](https://aka.ms/phi3-tech-report)
## Performance Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
### DirectML
We measured the performance of DirectML on AMD Ryzen 9 7940HS /w Radeon 78
| Prompt Length | Generation Length | Average Throughput (tps) |
|---------------------------|-------------------|-----------------------------|
| 128 | 128 | 53.46686 |
| 128 | 256 | 53.11233 |
| 128 | 512 | 57.45816 |
| 128 | 1024 | 33.44713 |
| 256 | 128 | 76.50182 |
| 256 | 256 | 66.68873 |
| 256 | 512 | 70.83862 |
| 256 | 1024 | 34.64715 |
| 512 | 128 | 85.10079 |
| 512 | 256 | 68.64049 |
| 512 | 512 | - |
| 512 | 1024 | - |
| 1024 | 128 | - |
| 1024 | 256 | - |
| 1024 | 512 | - |
| 1024 | 1024 | - |
## Contributors
Sim Sze Yu |