Update README.md
Browse files
README.md
CHANGED
@@ -278,11 +278,20 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
|
|
278 |
|
279 |
Our INT4 model is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
|
280 |
|
281 |
-
## Results
|
|
|
|
|
282 |
| Benchmark (Latency) | | |
|
283 |
|----------------------------------|----------------|--------------------------|
|
284 |
| | Qwen3-8B | Qwen3-8B-INT4 |
|
285 |
-
| latency (batch_size=1) | 3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
286 |
|
287 |
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
288 |
|
|
|
278 |
|
279 |
Our INT4 model is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
|
280 |
|
281 |
+
## Results
|
282 |
+
|
283 |
+
A100
|
284 |
| Benchmark (Latency) | | |
|
285 |
|----------------------------------|----------------|--------------------------|
|
286 |
| | Qwen3-8B | Qwen3-8B-INT4 |
|
287 |
+
| latency (batch_size=1) | 3.47s | 2.93s (1.18x speedup) |
|
288 |
+
|
289 |
+
H100
|
290 |
+
| Benchmark (Latency) | | |
|
291 |
+
|----------------------------------|----------------|--------------------------|
|
292 |
+
| | Qwen3-8B | Qwen3-8B-INT4 |
|
293 |
+
| latency (batch_size=1) | 2.46s | 1.40s (1.76x speedup) |
|
294 |
+
|
295 |
|
296 |
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
297 |
|