license: mit
pipeline_tag: text-generation
tags:
- ONNX
- DML
- ONNXRuntime
- phi3
- nlp
- conversational
- custom_code
inference: false
Phi-3 Mini-4K-Instruct ONNX models
This repository hosts the optimized versions of Phi-3-mini-4k-instruct to accelerate inference with ONNX Runtime.
Phi-3 Mini is a lightweight, state-of-the-art open model built upon datasets used for Phi-2 - synthetic data and filtered websites - with a focus on very high-quality, reasoning dense data. The model belongs to the Phi-3 model family, and the mini version comes in two variants: 4K and 128K which is the context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.
Optimized Phi-3 Mini models are published here in ONNX format to run with ONNX Runtime on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these targets.
DirectML support lets developers bring hardware acceleration to Windows devices at scale across AMD, Intel, and NVIDIA GPUs. Along with DirectML, ONNX Runtime provides cross platform support for Phi-3 Mini across a range of devices for CPU, GPU, and mobile.
To easily get started with Phi-3, you can use our newly introduced ONNX Runtime Generate() API. See here for instructions on how to run it.
ONNX Models
Here are some of the optimized configurations we have added:
- ONNX model for int4 DML: ONNX model for AMD, Intel, and NVIDIA GPUs on Windows, quantized to int4 using AWQ.
- ONNX model for fp16 CUDA: ONNX model you can use to run for your NVIDIA GPUs.
- ONNX model for int4 CUDA: ONNX model for NVIDIA GPUs using int4 quantization via RTN.
- ONNX model for int4 CPU and Mobile: ONNX model for CPU and mobile using int4 quantization via RTN. There are two versions uploaded to balance latency vs. accuracy. Acc=1 is targeted at improved accuracy, while Acc=4 is for improved perf. For mobile devices, we recommend using the model with acc-level-4.
Hardware Supported
The models are tested on:
- GPU SKU: RTX 4090 (DirectML)
- GPU SKU: 1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4 (CUDA)
- CPU SKU: Standard F64s v2 (64 vcpus, 128 GiB memory)
- Mobile SKU: Samsung Galaxy S21
Minimum Configuration Required:
- Windows: DirectX 12-capable GPU and a minimum of 4GB of combined RAM
- CUDA: NVIDIA GPU with Compute Capability >= 7.0
Model Description
- Developed by: Microsoft
- Model type: ONNX
- Language(s) (NLP): Python, C, C++
- License: MIT
- Model Description: This is a conversion of the Phi-3 Mini-4K-Instruct model for ONNX Runtime inference.
Additional Details
Performance Metrics
DirectML
We measured the performance of DirectML on AMD Ryzen 9 7940HS /w Radeon 78
Prompt Length | Generation Length | Average Throughput (tps) |
---|---|---|
128 | 128 | 53.46686 |
128 | 256 | 53.11233 |
128 | 512 | 57.45816 |
128 | 1024 | 33.44713 |
256 | 128 | 76.50182 |
256 | 256 | 66.68873 |
256 | 512 | 70.83862 |
256 | 1024 | 34.64715 |
512 | 128 | 85.10079 |
512 | 256 | 68.64049 |
512 | 512 | - |
512 | 1024 | - |
1024 | 128 | - |
1024 | 256 | - |
1024 | 512 | - |
1024 | 1024 | - |
Contributors
Sim Sze Yu