Update README.md
Browse files
README.md
CHANGED
|
@@ -186,19 +186,6 @@ and use a token with write access, from https://huggingface.co/settings/tokens
|
|
| 186 |
|
| 187 |
# Model Quality
|
| 188 |
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
|
| 189 |
-
Need to install lm-eval from source:
|
| 190 |
-
https://github.com/EleutherAI/lm-evaluation-harness#install
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
## baseline
|
| 194 |
-
```Shell
|
| 195 |
-
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
|
| 196 |
-
```
|
| 197 |
-
|
| 198 |
-
## float8 dynamic activation and float8 weight quantization (float8dq)
|
| 199 |
-
```Shell
|
| 200 |
-
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
|
| 201 |
-
```
|
| 202 |
|
| 203 |
| Benchmark | | |
|
| 204 |
|----------------------------------|----------------|-------------------------------|
|
|
@@ -222,6 +209,25 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq
|
|
| 222 |
| mathqa (0-shot) | 42.31 | 42.51 |
|
| 223 |
| **Overall** | **55.35** | **55.11** |
|
| 224 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 225 |
# Peak Memory Usage
|
| 226 |
|
| 227 |
|
|
@@ -233,7 +239,8 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq
|
|
| 233 |
| Peak Memory (GB) | 8.91 | 5.70 (36% reduction) |
|
| 234 |
|
| 235 |
|
| 236 |
-
|
|
|
|
| 237 |
|
| 238 |
We can use the following code to get a sense of peak memory usage during inference:
|
| 239 |
|
|
@@ -278,6 +285,8 @@ mem = torch.cuda.max_memory_reserved() / 1e9
|
|
| 278 |
print(f"Peak Memory Usage: {mem:.02f} GB")
|
| 279 |
```
|
| 280 |
|
|
|
|
|
|
|
| 281 |
# Model Performance
|
| 282 |
|
| 283 |
## Results (H100 machine)
|
|
@@ -291,6 +300,9 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
|
|
| 291 |
|
| 292 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
| 293 |
|
|
|
|
|
|
|
|
|
|
| 294 |
## Setup
|
| 295 |
Get vllm source code:
|
| 296 |
```Shell
|
|
@@ -351,6 +363,7 @@ Client:
|
|
| 351 |
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
|
| 352 |
```
|
| 353 |
|
|
|
|
| 354 |
|
| 355 |
|
| 356 |
# Disclaimer
|
|
|
|
| 186 |
|
| 187 |
# Model Quality
|
| 188 |
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 189 |
|
| 190 |
| Benchmark | | |
|
| 191 |
|----------------------------------|----------------|-------------------------------|
|
|
|
|
| 209 |
| mathqa (0-shot) | 42.31 | 42.51 |
|
| 210 |
| **Overall** | **55.35** | **55.11** |
|
| 211 |
|
| 212 |
+
<details>
|
| 213 |
+
<summary> Reproduce Model Quality Results </summary>
|
| 214 |
+
|
| 215 |
+
Need to install lm-eval from source:
|
| 216 |
+
https://github.com/EleutherAI/lm-evaluation-harness#install
|
| 217 |
+
|
| 218 |
+
|
| 219 |
+
## baseline
|
| 220 |
+
```Shell
|
| 221 |
+
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
## float8 dynamic activation and float8 weight quantization (float8dq)
|
| 225 |
+
```Shell
|
| 226 |
+
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
|
| 227 |
+
```
|
| 228 |
+
</details>
|
| 229 |
+
|
| 230 |
+
|
| 231 |
# Peak Memory Usage
|
| 232 |
|
| 233 |
|
|
|
|
| 239 |
| Peak Memory (GB) | 8.91 | 5.70 (36% reduction) |
|
| 240 |
|
| 241 |
|
| 242 |
+
<details>
|
| 243 |
+
<summary> Reproduce Peak Memory Usage Results </summary>
|
| 244 |
|
| 245 |
We can use the following code to get a sense of peak memory usage during inference:
|
| 246 |
|
|
|
|
| 285 |
print(f"Peak Memory Usage: {mem:.02f} GB")
|
| 286 |
```
|
| 287 |
|
| 288 |
+
</details>
|
| 289 |
+
|
| 290 |
# Model Performance
|
| 291 |
|
| 292 |
## Results (H100 machine)
|
|
|
|
| 300 |
|
| 301 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
| 302 |
|
| 303 |
+
<details>
|
| 304 |
+
<summary> Reproduce Model Performance Results </summary>
|
| 305 |
+
|
| 306 |
## Setup
|
| 307 |
Get vllm source code:
|
| 308 |
```Shell
|
|
|
|
| 363 |
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
|
| 364 |
```
|
| 365 |
|
| 366 |
+
</details>
|
| 367 |
|
| 368 |
|
| 369 |
# Disclaimer
|