alexmarques commited on
Commit
2eb1501
·
verified ·
1 Parent(s): 91597e9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -27
README.md CHANGED
@@ -131,6 +131,8 @@ The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande an
131
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
132
  This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
133
 
 
 
134
  ### Accuracy
135
 
136
  #### Open LLM Leaderboard evaluation scores
@@ -148,9 +150,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
148
  <tr>
149
  <td>MMLU (5-shot)
150
  </td>
151
- <td>69.43
152
  </td>
153
- <td>69.37
154
  </td>
155
  <td>99.9%
156
  </td>
@@ -158,21 +160,21 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
158
  <tr>
159
  <td>MMLU (CoT, 0-shot)
160
  </td>
161
- <td>72.56
162
  </td>
163
- <td>72.14
164
  </td>
165
- <td>99.4%
166
  </td>
167
  </tr>
168
  <tr>
169
  <td>ARC Challenge (0-shot)
170
  </td>
171
- <td>81.57
172
  </td>
173
- <td>81.48
174
  </td>
175
- <td>99.9%
176
  </td>
177
  </tr>
178
  <tr>
@@ -180,49 +182,49 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
180
  </td>
181
  <td>82.79
182
  </td>
183
- <td>82.64
184
  </td>
185
- <td>99.8%
186
  </td>
187
  </tr>
188
  <tr>
189
  <td>Hellaswag (10-shot)
190
  </td>
191
- <td>80.01
192
  </td>
193
- <td>80.1
194
  </td>
195
- <td>100.3%
196
  </td>
197
  </tr>
198
  <tr>
199
  <td>Winogrande (5-shot)
200
  </td>
201
- <td>77.90
202
  </td>
203
- <td>77.27
204
  </td>
205
- <td>99.2%
206
  </td>
207
  </tr>
208
  <tr>
209
- <td>TruthfulQA (0-shot, mc2)
210
  </td>
211
- <td>54.04
212
  </td>
213
- <td>54.15
214
  </td>
215
- <td>100.2%
216
  </td>
217
  </tr>
218
  <tr>
219
  <td><strong>Average</strong>
220
  </td>
221
- <td><strong>74.04</strong>
222
  </td>
223
- <td><strong>73.89</strong>
224
  </td>
225
- <td><strong>99.8%</strong>
226
  </td>
227
  </tr>
228
  </table>
@@ -235,7 +237,7 @@ The results were obtained using the following commands:
235
  ```
236
  lm_eval \
237
  --model vllm \
238
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
239
  --tasks mmlu_llama_3.1_instruct \
240
  --fewshot_as_multiturn \
241
  --apply_chat_template \
@@ -247,7 +249,7 @@ lm_eval \
247
  ```
248
  lm_eval \
249
  --model vllm \
250
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
251
  --tasks mmlu_cot_0shot_llama_3.1_instruct \
252
  --apply_chat_template \
253
  --num_fewshot 0 \
@@ -258,7 +260,7 @@ lm_eval \
258
  ```
259
  lm_eval \
260
  --model vllm \
261
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
262
  --tasks arc_challenge_llama_3.1_instruct \
263
  --apply_chat_template \
264
  --num_fewshot 0 \
@@ -269,7 +271,7 @@ lm_eval \
269
  ```
270
  lm_eval \
271
  --model vllm \
272
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,add_bos_token=True,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
273
  --tasks gsm8k_cot_llama_3.1_instruct \
274
  --fewshot_as_multiturn \
275
  --apply_chat_template \
 
131
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
132
  This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
133
 
134
+ **Note:** Results have been updated after Meta modified the chat template.
135
+
136
  ### Accuracy
137
 
138
  #### Open LLM Leaderboard evaluation scores
 
150
  <tr>
151
  <td>MMLU (5-shot)
152
  </td>
153
+ <td>68.32
154
  </td>
155
+ <td>68.26
156
  </td>
157
  <td>99.9%
158
  </td>
 
160
  <tr>
161
  <td>MMLU (CoT, 0-shot)
162
  </td>
163
+ <td>72.83
164
  </td>
165
+ <td>72.44
166
  </td>
167
+ <td>99.5%
168
  </td>
169
  </tr>
170
  <tr>
171
  <td>ARC Challenge (0-shot)
172
  </td>
173
+ <td>81.40
174
  </td>
175
+ <td>81.40
176
  </td>
177
+ <td>100.0%
178
  </td>
179
  </tr>
180
  <tr>
 
182
  </td>
183
  <td>82.79
184
  </td>
185
+ <td>84.31
186
  </td>
187
+ <td>101.8%
188
  </td>
189
  </tr>
190
  <tr>
191
  <td>Hellaswag (10-shot)
192
  </td>
193
+ <td>80.47
194
  </td>
195
+ <td>80.07
196
  </td>
197
+ <td>99.5%
198
  </td>
199
  </tr>
200
  <tr>
201
  <td>Winogrande (5-shot)
202
  </td>
203
+ <td>78.06
204
  </td>
205
+ <td>77.74
206
  </td>
207
+ <td>99.6%
208
  </td>
209
  </tr>
210
  <tr>
211
+ <td>TruthfulQA (0-shot, mc2)
212
  </td>
213
+ <td>54.48
214
  </td>
215
+ <td>54.04
216
  </td>
217
+ <td>99.2%
218
  </td>
219
  </tr>
220
  <tr>
221
  <td><strong>Average</strong>
222
  </td>
223
+ <td><strong>74.05</strong>
224
  </td>
225
+ <td><strong>74.04</strong>
226
  </td>
227
+ <td><strong>100.0%</strong>
228
  </td>
229
  </tr>
230
  </table>
 
237
  ```
238
  lm_eval \
239
  --model vllm \
240
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
241
  --tasks mmlu_llama_3.1_instruct \
242
  --fewshot_as_multiturn \
243
  --apply_chat_template \
 
249
  ```
250
  lm_eval \
251
  --model vllm \
252
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
253
  --tasks mmlu_cot_0shot_llama_3.1_instruct \
254
  --apply_chat_template \
255
  --num_fewshot 0 \
 
260
  ```
261
  lm_eval \
262
  --model vllm \
263
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
264
  --tasks arc_challenge_llama_3.1_instruct \
265
  --apply_chat_template \
266
  --num_fewshot 0 \
 
271
  ```
272
  lm_eval \
273
  --model vllm \
274
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",dtype=auto,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
275
  --tasks gsm8k_cot_llama_3.1_instruct \
276
  --fewshot_as_multiturn \
277
  --apply_chat_template \