jerryzh168 commited on
Commit
fa3b0b5
·
verified ·
1 Parent(s): 3bd3f11

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -2
README.md CHANGED
@@ -278,11 +278,20 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
278
 
279
  Our INT4 model is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
280
 
281
- ## Results (A100 machine)
 
 
282
  | Benchmark (Latency) | | |
283
  |----------------------------------|----------------|--------------------------|
284
  | | Qwen3-8B | Qwen3-8B-INT4 |
285
- | latency (batch_size=1) | 3.52s | 2.84s (1.24x speedup) |
 
 
 
 
 
 
 
286
 
287
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
288
 
 
278
 
279
  Our INT4 model is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
280
 
281
+ ## Results
282
+
283
+ A100
284
  | Benchmark (Latency) | | |
285
  |----------------------------------|----------------|--------------------------|
286
  | | Qwen3-8B | Qwen3-8B-INT4 |
287
+ | latency (batch_size=1) | 3.47s | 2.93s (1.18x speedup) |
288
+
289
+ H100
290
+ | Benchmark (Latency) | | |
291
+ |----------------------------------|----------------|--------------------------|
292
+ | | Qwen3-8B | Qwen3-8B-INT4 |
293
+ | latency (batch_size=1) | 2.46s | 1.40s (1.76x speedup) |
294
+
295
 
296
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
297