Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -120,44 +120,73 @@ Here's the speed and quality evaluation on two nano benchmarks. The higher the b
|
|
120 |

|
121 |

|
122 |
|
123 |
-
#### Table 1: Tokens per Second
|
124 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
125 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
126 |
llama_context: n_seq_max = 1
|
127 |
llama_context: n_ctx = 4096
|
128 |
llama_context: n_ctx_per_seq = 4096
|
129 |
llama_context: n_batch = 4096
|
130 |
llama_context: n_ubatch = 4096
|
131 |
llama_context: causal_attn = 1
|
132 |
-
llama_context: flash_attn =
|
133 |
llama_context: kv_unified = true
|
134 |
llama_context: freq_base = 1000000.0
|
135 |
llama_context: freq_scale = 1
|
136 |
llama_context: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
137 |
|
138 |
system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
|
|
|
|
139 |
```
|
140 |
|
141 |
-
|
142 |
-
|------------------|-----------|-----|--------------|--------------|--------------------|--------------------|
|
143 |
-
| IQ1_S | 748.77 MiB | 2.04 | 1608 | 1618 | +53% | +49% |
|
144 |
-
| IQ1_M | 804.97 MiB | 2.19 | 1553 | 1563 | +48% | +44% |
|
145 |
-
| IQ2_XXS | 898.64 MiB | 2.44 | 1600 | 1612 | +52% | +49% |
|
146 |
-
| IQ2_M | 1.06 GiB | 2.94 | 1529 | 1534 | +46% | +42% |
|
147 |
-
| Q2_K | 1.18 GiB | 3.29 | 1459 | 1471 | +39% | +36% |
|
148 |
-
| IQ3_XXS | 1.19 GiB | 3.31 | 1552 | 1487 | +48% | +37% |
|
149 |
-
| IQ3_XS | 1.29 GiB | 3.59 | 1529 | 1526 | +46% | +41% |
|
150 |
-
| IQ3_S | 1.35 GiB | 3.76 | 1520 | 1516 | +45% | +40% |
|
151 |
-
| IQ3_M | 1.38 GiB | 3.84 | 1507 | 1511 | +44% | +40% |
|
152 |
-
| Q3_K_M | 1.48 GiB | 4.11 | 1475 | 1487 | +40% | +37% |
|
153 |
-
| IQ4_NL | 1.69 GiB | 4.72 | 1464 | 1469 | +39% | +36% |
|
154 |
-
| IQ4_XS | 1.61 GiB | 4.49 | 1478 | 1487 | +41% | +37% |
|
155 |
-
| Q4_K_M | 1.79 GiB | 4.99 | 1454 | 1458 | +38% | +35% |
|
156 |
-
| Q5_K_S | 2.02 GiB | 5.61 | 1419 | 1429 | +35% | +32% |
|
157 |
-
| Q5_K_M | 2.07 GiB | 5.75 | 1404 | 1433 | +34% | +32% |
|
158 |
-
| Q6_K | 2.36 GiB | 6.56 | 1356 | 1382 | +29% | +28% |
|
159 |
-
| Q8_0 | 3.05 GiB | 8.50 | 1304 | 1334 | +24% | +23% |
|
160 |
-
| F16 | 5.75 GiB | 16.00 | 1050 | 1083 | +0% | +0% |
|
161 |
|
162 |
#### Table 2: NDCG@5
|
163 |
## NDCG@5 Performance Comparison
|
|
|
120 |

|
121 |

|
122 |
|
123 |
+
#### Table 1: Tokens per Second on NanoHotpotQA Documents
|
124 |
+
|
125 |
+
|
126 |
+
| Quantization Type | File Size | BPW | Peak VRAM | Token/s | Δ to F16 |
|
127 |
+
|------------------|-----------|-----|-----------|--------------|----------|
|
128 |
+
| IQ1_S | 748.77 MiB | 2.04 | 3651MB | 3625 | +7% |
|
129 |
+
| IQ1_M | 804.97 MiB | 2.19 | 3799MB | 3349 | -1% |
|
130 |
+
| IQ2_XXS | 898.64 MiB | 2.44 | 3799MB | 3701 | +9% |
|
131 |
+
| IQ2_M | 1.06 GiB | 2.94 | 3983MB | 3407 | +0% |
|
132 |
+
| Q2_K | 1.18 GiB | 3.29 | 4113MB | 3173 | -7% |
|
133 |
+
| IQ3_XXS | 1.19 GiB | 3.31 | 4119MB | 3668 | +8% |
|
134 |
+
| IQ3_XS | 1.29 GiB | 3.59 | 4221MB | 3604 | +6% |
|
135 |
+
| IQ3_S | 1.35 GiB | 3.76 | 4283MB | 3599 | +6% |
|
136 |
+
| IQ3_M | 1.38 GiB | 3.84 | 4315MB | 3603 | +6% |
|
137 |
+
| Q3_K_M | 1.48 GiB | 4.11 | 4411MB | 3450 | +2% |
|
138 |
+
| IQ4_NL | 1.69 GiB | 4.72 | 4635MB | 3571 | +5% |
|
139 |
+
| IQ4_XS | 1.61 GiB | 4.49 | 4553MB | 3585 | +5% |
|
140 |
+
| Q4_K_M | 1.79 GiB | 4.99 | 4735MB | 3558 | +5% |
|
141 |
+
| Q5_K_S | 2.02 GiB | 5.61 | 4963MB | 3567 | +5% |
|
142 |
+
| Q5_K_M | 2.07 GiB | 5.75 | 5017MB | 3528 | +4% |
|
143 |
+
| Q6_K | 2.36 GiB | 6.56 | 5315MB | 3334 | -2% |
|
144 |
+
| Q8_0 | 3.05 GiB | 8.50 | 6027MB | 3767 | +11% |
|
145 |
+
| F16 | 5.75 GiB | 16.00 | 9939MB | 3399 | +0% |
|
146 |
+
|
147 |
+
|
148 |
+
System info:
|
149 |
```
|
150 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
151 |
+
load_tensors: offloading 36 repeating layers to GPU
|
152 |
+
load_tensors: offloading output layer to GPU
|
153 |
+
load_tensors: offloaded 37/37 layers to GPU
|
154 |
+
load_tensors: CUDA0 model buffer size = 3127.61 MiB
|
155 |
+
load_tensors: CPU_Mapped model buffer size = 315.30 MiB
|
156 |
+
...................................................................................
|
157 |
+
llama_context: constructing llama_context
|
158 |
llama_context: n_seq_max = 1
|
159 |
llama_context: n_ctx = 4096
|
160 |
llama_context: n_ctx_per_seq = 4096
|
161 |
llama_context: n_batch = 4096
|
162 |
llama_context: n_ubatch = 4096
|
163 |
llama_context: causal_attn = 1
|
164 |
+
llama_context: flash_attn = 1
|
165 |
llama_context: kv_unified = true
|
166 |
llama_context: freq_base = 1000000.0
|
167 |
llama_context: freq_scale = 1
|
168 |
llama_context: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
|
169 |
+
llama_context: CUDA_Host output buffer size = 0.59 MiB
|
170 |
+
llama_kv_cache_unified: CUDA0 KV buffer size = 144.00 MiB
|
171 |
+
llama_kv_cache_unified: size = 144.00 MiB ( 4096 cells, 36 layers, 1/1 seqs), K (f16): 72.00 MiB, V (f16): 72.00 MiB
|
172 |
+
llama_context: CUDA0 compute buffer size = 2470.16 MiB
|
173 |
+
llama_context: CUDA_Host compute buffer size = 96.17 MiB
|
174 |
+
llama_context: graph nodes = 1234
|
175 |
+
llama_context: graph splits = 2
|
176 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
177 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
178 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
179 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
180 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
181 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
|
182 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
183 |
|
184 |
system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
185 |
+
main: n_tokens in batch = 0
|
186 |
+
main: number of embeddings = 5090
|
187 |
```
|
188 |
|
189 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
190 |
|
191 |
#### Table 2: NDCG@5
|
192 |
## NDCG@5 Performance Comparison
|