pytorch
/

Qwen3-8B-INT4

Text Generation

text-generation-inference

Model card Files Files and versions

jerryzh168 commited on 10 days ago

Commit

fa3b0b5

·

verified ·

1 Parent(s): 3bd3f11

Update README.md

Files changed (1) hide show

README.md +11 -2

README.md CHANGED Viewed

@@ -278,11 +278,20 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 Our INT4 model is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
-## Results (A100 machine)
 | Benchmark (Latency)              |                |                          |
 |----------------------------------|----------------|--------------------------|
 |                                  | Qwen3-8B       | Qwen3-8B-INT4            |
-| latency (batch_size=1)           | 3.52s          | 2.84s (1.24x speedup)    |
 Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.

 Our INT4 model is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
+## Results
+A100
 | Benchmark (Latency)              |                |                          |
 |----------------------------------|----------------|--------------------------|
 |                                  | Qwen3-8B       | Qwen3-8B-INT4            |
+| latency (batch_size=1)           | 3.47s          | 2.93s (1.18x speedup)    |
+H100
+| Benchmark (Latency)              |                |                          |
+|----------------------------------|----------------|--------------------------|
+|                                  | Qwen3-8B       | Qwen3-8B-INT4            |
+| latency (batch_size=1)           | 2.46s          | 1.40s (1.76x speedup)    |
 Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.