jerryzh168 commited on
Commit
dd89b44
·
verified ·
1 Parent(s): 700e10a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -38
README.md CHANGED
@@ -295,11 +295,8 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
295
  | | Phi-4 mini-Ins | Phi-4-mini-instruct-FP8 |
296
  | latency (batch_size=1) | 1.61s | 1.25s (1.29x speedup) |
297
  | latency (batch_size=256) | 5.16s | 4.89s (1.05x speedup) |
298
- | serving (num_prompts=1) | 1.37 req/s | 1.66 req/s (1.21x speedup) |
299
- | serving (num_prompts=1000) | 62.55 req/s | 72.56 req/s (1.16x speedup) |
300
 
301
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
302
- Note the result is not using fbgemm kernels, (no `fbgemm-gpu-genai` installed), fbgemm kernels has less speedup when num_prompts is 1000 currently.
303
 
304
  <details>
305
  <summary> Reproduce Model Performance Results </summary>
@@ -334,41 +331,6 @@ python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model
334
  VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-FP8 --batch-size 1
335
  ```
336
 
337
- ## benchmark_serving
338
-
339
- We benchmarked the throughput in a serving environment.
340
-
341
- Download sharegpt dataset:
342
-
343
- ```Shell
344
- wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
345
- ```
346
-
347
- Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
348
-
349
- Note: you can change the number of prompts to be benchmarked with `--num-prompts` argument for `benchmark_serving` script.
350
- ### baseline
351
- Server:
352
- ```Shell
353
- vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
354
- ```
355
-
356
- Client:
357
- ```Shell
358
- python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
359
- ```
360
-
361
- ### FP8
362
- Server:
363
- ```Shell
364
- VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-FP8 --tokenizer microsoft/Phi-4-mini-instruct -O3
365
- ```
366
-
367
- Client:
368
- ```Shell
369
- python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-FP8 --num-prompts 1
370
- ```
371
-
372
  </details>
373
 
374
  # Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization
 
295
  | | Phi-4 mini-Ins | Phi-4-mini-instruct-FP8 |
296
  | latency (batch_size=1) | 1.61s | 1.25s (1.29x speedup) |
297
  | latency (batch_size=256) | 5.16s | 4.89s (1.05x speedup) |
 
 
298
 
299
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
 
300
 
301
  <details>
302
  <summary> Reproduce Model Performance Results </summary>
 
331
  VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-FP8 --batch-size 1
332
  ```
333
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
334
  </details>
335
 
336
  # Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization