nm-research commited on
Commit
829c019
·
verified ·
1 Parent(s): 34b5056

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +144 -33
README.md CHANGED
@@ -42,7 +42,7 @@ from transformers import AutoTokenizer
42
  from vllm import LLM, SamplingParams
43
 
44
  max_model_len, tp_size = 4096, 1
45
- model_name = "neuralmagic-ent/granite-3.1-2b-instruct-FP8-dynamic"
46
  tokenizer = AutoTokenizer.from_pretrained(model_name)
47
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
48
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
@@ -65,7 +65,9 @@ vLLM also supports OpenAI-compatible serving. See the [documentation](https://do
65
 
66
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
67
 
68
-
 
 
69
  ```bash
70
  python quantize.py --model_id ibm-granite/granite-3.1-2b-instruct --save_path "output_dir/"
71
  ```
@@ -110,28 +112,43 @@ def main():
110
  if __name__ == "__main__":
111
  main()
112
  ```
 
113
 
114
  ## Evaluation
115
 
116
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
117
 
 
 
 
118
  OpenLLM Leaderboard V1:
119
  ```
120
  lm_eval \
121
  --model vllm \
122
- --model_args pretrained="neuralmagic-ent/granite-3.1-2b-instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
123
  --tasks openllm \
124
  --write_out \
125
  --batch_size auto \
126
  --output_path output_dir \
127
  --show_config
128
  ```
 
 
 
 
 
 
 
 
 
 
 
129
 
130
  #### HumanEval
131
  ##### Generation
132
  ```
133
  python3 codegen/generate.py \
134
- --model neuralmagic-ent/granite-3.1-2b-instruct-FP8-dynamic \
135
  --bs 16 \
136
  --temperature 0.2 \
137
  --n_samples 50 \
@@ -141,45 +158,130 @@ python3 codegen/generate.py \
141
  ##### Sanitization
142
  ```
143
  python3 evalplus/sanitize.py \
144
- humaneval/neuralmagic-ent--granite-3.1-2b-instruct-FP8-dynamic_vllm_temp_0.2
145
  ```
146
  ##### Evaluation
147
  ```
148
  evalplus.evaluate \
149
  --dataset humaneval \
150
- --samples humaneval/neuralmagic-ent--granite-3.1-2b-instruct-FP8-dynamic_vllm_temp_0.2-sanitized
151
  ```
 
152
 
153
  ### Accuracy
154
 
155
  #### OpenLLM Leaderboard V1 evaluation scores
156
 
157
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic-ent/granite-3.1-2b-instruct-FP8-dynamic |
158
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
159
- | ARC-Challenge (Acc-Norm, 25-shot) | 55.63 | 55.03 |
160
- | GSM8K (Strict-Match, 5-shot) | 60.96 | 61.49 |
161
- | HellaSwag (Acc-Norm, 10-shot) | 75.21 | 75.26 |
162
- | MMLU (Acc, 5-shot) | 54.38 | 54.24 |
163
- | TruthfulQA (MC2, 0-shot) | 55.93 | 55.42 |
164
- | Winogrande (Acc, 5-shot) | 69.67 | 69.61 |
165
- | **Average Score** | **61.98** | **61.84** |
166
- | **Recovery** | **100.00** | **99.78** |
167
-
168
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic-ent/granite-3.1-2b-instruct-FP8-dynamic |
169
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
170
- | IFEval (Inst Level Strict Acc, 0-shot)| 67.99 | 66.79 |
171
- | BBH (Acc-Norm, 3-shot) | 44.11 | 44.24 |
172
- | Math-Hard (Exact-Match, 4-shot) | 8.66 | 7.89 |
173
- | GPQA (Acc-Norm, 0-shot) | 28.30 | 26.90 |
174
- | MUSR (Acc-Norm, 0-shot) | 35.12 | 35.12 |
175
- | MMLU-Pro (Acc, 5-shot) | 26.87 | 28.33 |
176
- | **Average Score** | **35.17** | **34.88** |
177
- | **Recovery** | **100.00** | **99.16** |
178
-
179
- #### HumanEval pass@1 scores
180
- | Metric | ibm-granite/granite-3.1-2b-instruct | neuralmagic-ent/granite-3.1-2b-instruct-FP8-dynamic |
181
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
182
- | HumanEval Pass@1 | 53.40 | 54.90 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
183
 
184
 
185
  ## Inference Performance
@@ -188,6 +290,15 @@ evalplus.evaluate \
188
  This model achieves up to 1.2x speedup in single-stream deployment on L40 GPUs.
189
  The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.6.6.post1, and [GuideLLM](https://github.com/neuralmagic/guidellm).
190
 
 
 
 
 
 
 
 
 
 
191
  ### Single-stream performance (measured with vLLM version 0.6.6.post1)
192
  <table>
193
  <tr>
 
42
  from vllm import LLM, SamplingParams
43
 
44
  max_model_len, tp_size = 4096, 1
45
+ model_name = "neuralmagic/granite-3.1-2b-instruct-FP8-dynamic"
46
  tokenizer = AutoTokenizer.from_pretrained(model_name)
47
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
48
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
 
65
 
66
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
67
 
68
+ <details>
69
+ <summary>Model Creation Code</summary>
70
+
71
  ```bash
72
  python quantize.py --model_id ibm-granite/granite-3.1-2b-instruct --save_path "output_dir/"
73
  ```
 
112
  if __name__ == "__main__":
113
  main()
114
  ```
115
+ </details>
116
 
117
  ## Evaluation
118
 
119
+ The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
120
 
121
+ <details>
122
+ <summary>Evaluation Commands</summary>
123
+
124
  OpenLLM Leaderboard V1:
125
  ```
126
  lm_eval \
127
  --model vllm \
128
+ --model_args pretrained="neuralmagic/granite-3.1-2b-instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
129
  --tasks openllm \
130
  --write_out \
131
  --batch_size auto \
132
  --output_path output_dir \
133
  --show_config
134
  ```
135
+ OpenLLM Leaderboard V2:
136
+ ```
137
+ lm_eval \
138
+ --model vllm \
139
+ --model_args pretrained="neuralmagic/granite-3.1-2b-instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
140
+ --tasks leaderboard \
141
+ --write_out \
142
+ --batch_size auto \
143
+ --output_path output_dir \
144
+ --show_config
145
+ ```
146
 
147
  #### HumanEval
148
  ##### Generation
149
  ```
150
  python3 codegen/generate.py \
151
+ --model neuralmagic/granite-3.1-2b-instruct-FP8-dynamic \
152
  --bs 16 \
153
  --temperature 0.2 \
154
  --n_samples 50 \
 
158
  ##### Sanitization
159
  ```
160
  python3 evalplus/sanitize.py \
161
+ humaneval/neuralmagic--granite-3.1-2b-instruct-FP8-dynamic_vllm_temp_0.2
162
  ```
163
  ##### Evaluation
164
  ```
165
  evalplus.evaluate \
166
  --dataset humaneval \
167
+ --samples humaneval/neuralmagic--granite-3.1-2b-instruct-FP8-dynamic_vllm_temp_0.2-sanitized
168
  ```
169
+ </details>
170
 
171
  ### Accuracy
172
 
173
  #### OpenLLM Leaderboard V1 evaluation scores
174
 
175
+ <table>
176
+ <thead>
177
+ <tr>
178
+ <th>Category</th>
179
+ <th>Metric</th>
180
+ <th>ibm-granite/granite-3.1-2b-instruct</th>
181
+ <th>neuralmagic-ent/granite-3.1-2b-instruct-FP8-dynamic</th>
182
+ <th>Recovery (%)</th>
183
+ </tr>
184
+ </thead>
185
+ <tbody>
186
+ <!-- OpenLLM Leaderboard V1 -->
187
+ <tr>
188
+ <td rowspan="7"><b>OpenLLM Leaderboard V1</b></td>
189
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
190
+ <td>55.63</td>
191
+ <td>55.03</td>
192
+ <td>98.92</td>
193
+ </tr>
194
+ <tr>
195
+ <td>GSM8K (Strict-Match, 5-shot)</td>
196
+ <td>60.96</td>
197
+ <td>61.49</td>
198
+ <td>100.87</td>
199
+ </tr>
200
+ <tr>
201
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
202
+ <td>75.21</td>
203
+ <td>75.26</td>
204
+ <td>100.07</td>
205
+ </tr>
206
+ <tr>
207
+ <td>MMLU (Acc, 5-shot)</td>
208
+ <td>54.38</td>
209
+ <td>54.24</td>
210
+ <td>99.74</td>
211
+ </tr>
212
+ <tr>
213
+ <td>TruthfulQA (MC2, 0-shot)</td>
214
+ <td>55.93</td>
215
+ <td>55.42</td>
216
+ <td>99.09</td>
217
+ </tr>
218
+ <tr>
219
+ <td>Winogrande (Acc, 5-shot)</td>
220
+ <td>69.67</td>
221
+ <td>69.61</td>
222
+ <td>99.91</td>
223
+ </tr>
224
+ <tr>
225
+ <td><b>Average Score</b></td>
226
+ <td><b>61.98</b></td>
227
+ <td><b>61.84</b></td>
228
+ <td><b>99.78</b></td>
229
+ </tr>
230
+ <!-- OpenLLM Leaderboard V2 -->
231
+ <tr>
232
+ <td rowspan="7"><b>OpenLLM Leaderboard V2</b></td>
233
+ <td>IFEval (Inst Level Strict Acc, 0-shot)</td>
234
+ <td>67.99</td>
235
+ <td>66.79</td>
236
+ <td>98.24</td>
237
+ </tr>
238
+ <tr>
239
+ <td>BBH (Acc-Norm, 3-shot)</td>
240
+ <td>44.11</td>
241
+ <td>44.24</td>
242
+ <td>100.29</td>
243
+ </tr>
244
+ <tr>
245
+ <td>Math-Hard (Exact-Match, 4-shot)</td>
246
+ <td>8.66</td>
247
+ <td>7.89</td>
248
+ <td>91.12</td>
249
+ </tr>
250
+ <tr>
251
+ <td>GPQA (Acc-Norm, 0-shot)</td>
252
+ <td>28.30</td>
253
+ <td>26.90</td>
254
+ <td>95.06</td>
255
+ </tr>
256
+ <tr>
257
+ <td>MUSR (Acc-Norm, 0-shot)</td>
258
+ <td>35.12</td>
259
+ <td>35.12</td>
260
+ <td>100.00</td>
261
+ </tr>
262
+ <tr>
263
+ <td>MMLU-Pro (Acc, 5-shot)</td>
264
+ <td>26.87</td>
265
+ <td>28.33</td>
266
+ <td>105.42</td>
267
+ </tr>
268
+ <tr>
269
+ <td><b>Average Score</b></td>
270
+ <td><b>35.17</b></td>
271
+ <td><b>34.88</b></td>
272
+ <td><b>99.16</b></td>
273
+ </tr>
274
+ <!-- HumanEval -->
275
+ <tr>
276
+ <td rowspan="2"><b>HumanEval</b></td>
277
+ <td>HumanEval Pass@1</td>
278
+ <td>53.40</td>
279
+ <td>54.90</td>
280
+ <td><b>102.81</b></td>
281
+ </tr>
282
+ </tbody>
283
+ </table>
284
+
285
 
286
 
287
  ## Inference Performance
 
290
  This model achieves up to 1.2x speedup in single-stream deployment on L40 GPUs.
291
  The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.6.6.post1, and [GuideLLM](https://github.com/neuralmagic/guidellm).
292
 
293
+ <details>
294
+ <summary>Benchmarking Command</summary>
295
+
296
+ ```
297
+ guidellm --model neuralmagic/granite-3.1-2b-instruct-FP8-dynamic --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server
298
+ ```
299
+
300
+ </details>
301
+
302
  ### Single-stream performance (measured with vLLM version 0.6.6.post1)
303
  <table>
304
  <tr>