ssyok commited on
Commit
99a982d
·
verified ·
1 Parent(s): e034964

Update README.md

Browse files

temp create a sample readme

Files changed (1) hide show
  1. README.md +86 -3
README.md CHANGED
@@ -1,3 +1,86 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: text-generation
4
+ tags: [ONNX, DML, ONNXRuntime, phi3, nlp, conversational, custom_code]
5
+ inference: false
6
+ ---
7
+
8
+ # Phi-3 Mini-4K-Instruct ONNX models
9
+
10
+ <!-- Provide a quick summary of what the model is/does. -->
11
+ This repository hosts the optimized versions of [Phi-3-mini-4k-instruct](https://aka.ms/phi3-mini-4k-instruct) to accelerate inference with ONNX Runtime.
12
+
13
+ Phi-3 Mini is a lightweight, state-of-the-art open model built upon datasets used for Phi-2 - synthetic data and filtered websites - with a focus on very high-quality, reasoning dense data. The model belongs to the Phi-3 model family, and the mini version comes in two variants: 4K and 128K which is the context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.
14
+
15
+ Optimized Phi-3 Mini models are published here in [ONNX](https://onnx.ai) format to run with [ONNX Runtime](https://onnxruntime.ai/) on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these targets.
16
+
17
+ [DirectML](https://aka.ms/directml) support lets developers bring hardware acceleration to Windows devices at scale across AMD, Intel, and NVIDIA GPUs. Along with DirectML, ONNX Runtime provides cross platform support for Phi-3 Mini across a range of devices for CPU, GPU, and mobile.
18
+
19
+ To easily get started with Phi-3, you can use our newly introduced ONNX Runtime Generate() API. See [here](https://aka.ms/generate-tutorial) for instructions on how to run it.
20
+
21
+ ## ONNX Models
22
+
23
+ Here are some of the optimized configurations we have added:
24
+
25
+ 1. ONNX model for int4 DML: ONNX model for AMD, Intel, and NVIDIA GPUs on Windows, quantized to int4 using [AWQ](https://arxiv.org/abs/2306.00978).
26
+ 2. ONNX model for fp16 CUDA: ONNX model you can use to run for your NVIDIA GPUs.
27
+ 3. ONNX model for int4 CUDA: ONNX model for NVIDIA GPUs using int4 quantization via RTN.
28
+ 4. ONNX model for int4 CPU and Mobile: ONNX model for CPU and mobile using int4 quantization via RTN. There are two versions uploaded to balance latency vs. accuracy.
29
+ Acc=1 is targeted at improved accuracy, while Acc=4 is for improved perf. For mobile devices, we recommend using the model with acc-level-4.
30
+
31
+
32
+ ## Hardware Supported
33
+
34
+ The models are tested on:
35
+ - GPU SKU: RTX 4090 (DirectML)
36
+ - GPU SKU: 1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4 (CUDA)
37
+ - CPU SKU: Standard F64s v2 (64 vcpus, 128 GiB memory)
38
+ - Mobile SKU: Samsung Galaxy S21
39
+
40
+ Minimum Configuration Required:
41
+ - Windows: DirectX 12-capable GPU and a minimum of 4GB of combined RAM
42
+ - CUDA: NVIDIA GPU with [Compute Capability](https://developer.nvidia.com/cuda-gpus) >= 7.0
43
+
44
+ ### Model Description
45
+
46
+ - **Developed by:** Microsoft
47
+ - **Model type:** ONNX
48
+ - **Language(s) (NLP):** Python, C, C++
49
+ - **License:** MIT
50
+ - **Model Description:** This is a conversion of the Phi-3 Mini-4K-Instruct model for ONNX Runtime inference.
51
+
52
+ ## Additional Details
53
+ - [**ONNX Runtime Optimizations Blog Link**](https://aka.ms/phi3-optimizations)
54
+ - [**Phi-3 Model Blog Link**](https://aka.ms/phi3blog-april)
55
+ - [**Phi-3 Model Card**]( https://aka.ms/phi3-mini-4k-instruct)
56
+ - [**Phi-3 Technical Report**](https://aka.ms/phi3-tech-report)
57
+
58
+
59
+ ## Performance Metrics
60
+
61
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
62
+ ### DirectML
63
+ We measured the performance of DirectML on AMD Ryzen 9 7940HS /w Radeon 78
64
+
65
+ | Batch Size, Prompt Length | Generation Length | Average Throughput (tps) |
66
+ |---------------------------|-------------------|-----------------------------|
67
+ | 1, 128 | 128 | |
68
+ | 1, 128 | 256 | |
69
+ | 1, 128 | 512 | |
70
+ | 1, 128 | 1024 | |
71
+ | 1, 256 | 128 | |
72
+ | 1, 256 | 256 | |
73
+ | 1, 256 | 512 | |
74
+ | 1, 256 | 1024 | |
75
+ | 1, 512 | 128 | |
76
+ | 1, 512 | 256 | |
77
+ | 1, 512 | 512 | - |
78
+ | 1, 512 | 1024 | - |
79
+ | 1, 1024 | 128 | - |
80
+ | 1, 1024 | 256 | - |
81
+ | 1, 1024 | 512 | - |
82
+ | 1, 1024 | 1024 | - |
83
+
84
+
85
+ ## Contributors
86
+ Sim Sze Yu