nm-research commited on
Commit
b7bcd41
·
verified ·
1 Parent(s): 266349d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +436 -3
README.md CHANGED
@@ -1,3 +1,436 @@
1
- ---
2
- license: llama4
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - fp4
4
+ - vllm
5
+ language:
6
+ - en
7
+ - de
8
+ - fr
9
+ - it
10
+ - pt
11
+ - hi
12
+ - es
13
+ - th
14
+ pipeline_tag: text-generation
15
+ license: llama3.1
16
+ base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct
17
+ ---
18
+
19
+ # Llama-4-Scout-17B-16E-Instruct-NVFP4
20
+
21
+ ## Model Overview
22
+ - **Model Architecture:** Meta-Llama-3.1
23
+ - **Input:** Text
24
+ - **Output:** Text
25
+ - **Model Optimizations:**
26
+ - **Weight quantization:** FP4
27
+ - **Activation quantization:** FP4
28
+ - **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct), this models is intended for assistant-like chat.
29
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
30
+ - **Release Date:** 7/15/25
31
+ - **Version:** 1.0
32
+ - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
33
+ - **Model Developers:** RedHatAI
34
+
35
+ This model is a quantized version of [Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct).
36
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model.
37
+
38
+ ### Model Optimizations
39
+
40
+ This model was obtained by quantizing the weights and activations of [Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) to FP4 data type, ready for inference with vLLM>=0.9.1
41
+ This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
42
+
43
+ Only the weights of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor).
44
+
45
+ ## Deployment
46
+
47
+ ### Use with vLLM
48
+
49
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
50
+
51
+ ```python
52
+ from vllm import LLM, SamplingParams
53
+ from transformers import AutoTokenizer
54
+
55
+ model_id = "RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4"
56
+ number_gpus = 2
57
+
58
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
59
+
60
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
61
+
62
+ messages = [
63
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
64
+ {"role": "user", "content": "Who are you?"},
65
+ ]
66
+
67
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
68
+
69
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
70
+
71
+ outputs = llm.generate(prompts, sampling_params)
72
+
73
+ generated_text = outputs[0].outputs[0].text
74
+ print(generated_text)
75
+ ```
76
+
77
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
78
+
79
+ ## Creation
80
+
81
+ This model was created by applying [LLM Compressor with calibration samples from neuralmagic/calibration dataset](https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/llama4_example.py), as presented in the code snipet below.
82
+
83
+ ```python
84
+ import torch
85
+ from datasets import load_dataset
86
+ from transformers import Llama4ForConditionalGeneration, Llama4Processor
87
+
88
+ from llmcompressor import oneshot
89
+ from llmcompressor.modeling import prepare_for_calibration
90
+ from llmcompressor.modifiers.quantization import GPTQModifier
91
+
92
+ # Select model and load it.
93
+ model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
94
+ model = Llama4ForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
95
+ processor = Llama4Processor.from_pretrained(model_id)
96
+ # We update `Llama4TextMoe` modules with custom `SequentialLlama4TextMoe`.
97
+ # This change allows compatibility with vllm.
98
+ # To apply your own custom module for experimentation, consider updating
99
+ # `SequentialLlama4TextMoe` under llmcompressor/modeling/llama4.py
100
+ model = prepare_for_calibration(model)
101
+
102
+ DATASET_ID = "neuralmagic/calibration"
103
+ NUM_CALIBRATION_SAMPLES = 512
104
+ MAX_SEQUENCE_LENGTH = 8192
105
+
106
+ ds = load_dataset(DATASET_ID, name="LLM", split=f"train[:{NUM_CALIBRATION_SAMPLES}]")
107
+
108
+
109
+ def preprocess_function(example):
110
+ messgages = []
111
+ for message in example["messages"]:
112
+ messgages.append(
113
+ {
114
+ "role": message["role"],
115
+ "content": [{"type": "text", "text": message["content"]}],
116
+ }
117
+ )
118
+
119
+ return processor.apply_chat_template(
120
+ messgages,
121
+ return_tensors="pt",
122
+ padding=False,
123
+ truncation=True,
124
+ max_length=MAX_SEQUENCE_LENGTH,
125
+ tokenize=True,
126
+ add_special_tokens=False,
127
+ return_dict=True,
128
+ add_generation_prompt=False,
129
+ )
130
+
131
+
132
+ ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)
133
+
134
+
135
+ def data_collator(batch):
136
+ assert len(batch) == 1
137
+ return {
138
+ key: torch.tensor(value)
139
+ if key != "pixel_values"
140
+ else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
141
+ for key, value in batch[0].items()
142
+ }
143
+
144
+
145
+ # Configure the quantization algorithm to run.
146
+ recipe = GPTQModifier(
147
+ targets="Linear",
148
+ scheme="W4A16",
149
+ ignore=[
150
+ "re:.*lm_head",
151
+ "re:.*self_attn",
152
+ "re:.*router",
153
+ "re:vision_model.*",
154
+ "re:multi_modal_projector.*",
155
+ "Llama4TextAttention",
156
+ ],
157
+ )
158
+
159
+ # Apply algorithms.
160
+ # due to the large size of Llama4, we specify sequential targets such that
161
+ # only one MLP is loaded into GPU memory at a time
162
+ oneshot(
163
+ model=model,
164
+ dataset=ds,
165
+ recipe=recipe,
166
+ max_seq_length=MAX_SEQUENCE_LENGTH,
167
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
168
+ data_collator=data_collator,
169
+ sequential_targets=["Llama4TextMLP"],
170
+ )
171
+
172
+ # Save to disk compressed.
173
+ SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128"
174
+ model.save_pretrained(SAVE_DIR, save_compressed=True)
175
+ processor.save_pretrained(SAVE_DIR)
176
+
177
+ ```
178
+
179
+ ## Evaluation
180
+
181
+ This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval_64 benchmarks. All evaluations were conducted using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness).
182
+ <table>
183
+ <thead>
184
+ <tr>
185
+ <th>Category</th>
186
+ <th>Metric</th>
187
+ <th>Llama-4-Scout-17B-16E-Instruct (A100)</th>
188
+ <th>Llama-4-Scout-17B-16E-Instruct-NVFP4 (B200)</th>
189
+ <th>Recovery (%)</th>
190
+ </tr>
191
+ </thead>
192
+ <tbody>
193
+ <tr>
194
+ <td rowspan="8"><b>OpenLLM V1</b></td>
195
+ <td>ARC Challenge (LLaMA)</td>
196
+ <td>93.39</td>
197
+ <td>92.10</td>
198
+ <td>98.62%</td>
199
+ </tr>
200
+ <tr>
201
+ <td>GSM8K (LLaMA)</td>
202
+ <td>92.87</td>
203
+ <td>94.31</td>
204
+ <td>101.55%</td>
205
+ </tr>
206
+ <tr>
207
+ <td>MMLU (LLaMA)</td>
208
+ <td>81.01</td>
209
+ <td>79.37</td>
210
+ <td>97.98%</td>
211
+ </tr>
212
+ <tr>
213
+ <td>MMLU-CoT (LLaMA)</td>
214
+ <td>85.99</td>
215
+ <td>84.58</td>
216
+ <td>98.36%</td>
217
+ </tr>
218
+ <tr>
219
+ <td>Hellaswag</td>
220
+ <td>79.13</td>
221
+ <td>78.47</td>
222
+ <td>99.17%</td>
223
+ </tr>
224
+ <tr>
225
+ <td>TruthfulQA-mc2</td>
226
+ <td>62.53</td>
227
+ <td>60.83</td>
228
+ <td>97.28%</td>
229
+ </tr>
230
+ <tr>
231
+ <td>Winogrande</td>
232
+ <td>73.56</td>
233
+ <td>73.01</td>
234
+ <td>99.25%</td>
235
+ </tr>
236
+ <tr>
237
+ <td><b>Average</b></td>
238
+ <td><b>81.21</b></td>
239
+ <td><b>80.38</b></td>
240
+ <td><b>98.89%</b></td>
241
+ </tr>
242
+ <tr>
243
+ <td rowspan="7"><b>OpenLLM V2</b></td>
244
+ <td>MMLU-Pro</td>
245
+ <td>55.64</td>
246
+ <td>53.84</td>
247
+ <td>96.76%</td>
248
+ </tr>
249
+ <tr>
250
+ <td>IFEval</td>
251
+ <td>89.09</td>
252
+ <td>89.93</td>
253
+ <td>100.94%</td>
254
+ </tr>
255
+ <tr>
256
+ <td>BBH</td>
257
+ <td>65.14</td>
258
+ <td>64.00</td>
259
+ <td>98.25%</td>
260
+ </tr>
261
+ <tr>
262
+ <td>Math-Hard</td>
263
+ <td>52.64</td>
264
+ <td>56.12</td>
265
+ <td>106.61%</td>
266
+ </tr>
267
+ <tr>
268
+ <td>GPQA</td>
269
+ <td>32.21</td>
270
+ <td>31.88</td>
271
+ <td>98.98%</td>
272
+ </tr>
273
+ <tr>
274
+ <td>MuSR</td>
275
+ <td>42.20</td>
276
+ <td>42.99</td>
277
+ <td>101.87%</td>
278
+ </tr>
279
+ <tr>
280
+ <td><b>Average</b></td>
281
+ <td><b>56.15</b></td>
282
+ <td><b>56.46</b></td>
283
+ <td><b>100.55%</b></td>
284
+ </tr>
285
+ <tr>
286
+ <td><b>Coding</b></td>
287
+ <td>HumanEval Instruct pass@1</td>
288
+ <td>81.71</td>
289
+ <td>76.22</td>
290
+ <td>93.29%</td>
291
+ </tr>
292
+ <tr>
293
+ <td rowspan="5"></td>
294
+ <td>HumanEval 64 Instruct pass@2</td>
295
+ <td>83.49</td>
296
+ <td>81.10</td>
297
+ <td>97.14%</td>
298
+ </tr>
299
+ <tr>
300
+ <td>HumanEval 64 Instruct pass@8</td>
301
+ <td>87.71</td>
302
+ <td>88.66</td>
303
+ <td>101.08%</td>
304
+ </tr>
305
+ <tr>
306
+ <td>HumanEval 64 Instruct pass@16</td>
307
+ <td>88.71</td>
308
+ <td>90.11</td>
309
+ <td>101.58%</td>
310
+ </tr>
311
+ <tr>
312
+ <td>HumanEval 64 Instruct pass@32</td>
313
+ <td>89.38</td>
314
+ <td>90.91</td>
315
+ <td>101.71%</td>
316
+ </tr>
317
+ <tr>
318
+ <td>HumanEval 64 Instruct pass@64</td>
319
+ <td>89.63</td>
320
+ <td>91.46</td>
321
+ <td>102.04%</td>
322
+ </tr>
323
+ </tbody>
324
+ </table>
325
+
326
+
327
+ ### Reproduction
328
+
329
+ The results were obtained using the following commands:
330
+
331
+ #### MMLU_LLAMA
332
+ ```
333
+ lm_eval \
334
+ --model vllm \
335
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
336
+ --tasks mmlu_llama \
337
+ --apply_chat_template \
338
+ --fewshot_as_multiturn \
339
+ --batch_size auto
340
+ ```
341
+
342
+ #### MMLU_COT_LLAMA
343
+ ```
344
+ lm_eval \
345
+ --model vllm \
346
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
347
+ --tasks mmlu_cot_llama \
348
+ --apply_chat_template \
349
+ --fewshot_as_multiturn \
350
+ --batch_size auto
351
+ ```
352
+
353
+ #### ARC-Challenge
354
+ ```
355
+ lm_eval \
356
+ --model vllm \
357
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
358
+ --tasks arc_challenge_llama \
359
+ --apply_chat_template \
360
+ --batch_size auto
361
+ ```
362
+
363
+ #### GSM-8K
364
+ ```
365
+ lm_eval \
366
+ --model vllm \
367
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
368
+ --tasks gsm8k_llama \
369
+ --apply_chat_template \
370
+ --fewshot_as_multiturn \
371
+ --batch_size auto
372
+ ```
373
+
374
+ #### Hellaswag
375
+ ```
376
+ lm_eval \
377
+ --model vllm \
378
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
379
+ --tasks hellaswag \
380
+ --apply_chat_template \
381
+ --fewshot_as_multiturn \
382
+ --batch_size auto
383
+ ```
384
+
385
+ #### Winogrande
386
+ ```
387
+ lm_eval \
388
+ --model vllm \
389
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
390
+ --tasks winogrande \
391
+ --apply_chat_template \
392
+ --fewshot_as_multiturn \
393
+ --batch_size auto
394
+ ```
395
+
396
+ #### TruthfulQA
397
+ ```
398
+ lm_eval \
399
+ --model vllm \
400
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
401
+ --tasks truthfulqa \
402
+ --apply_chat_template \
403
+ --fewshot_as_multiturn \
404
+ --batch_size auto
405
+ ```
406
+
407
+ #### OpenLLM v2
408
+ ```
409
+ lm_eval \
410
+ --model vllm \
411
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
412
+ --apply_chat_template \
413
+ --fewshot_as_multiturn \
414
+ --tasks leaderboard \
415
+ --batch_size auto
416
+ ```
417
+
418
+ #### HumanEval and HumanEval_64
419
+ ```
420
+ lm_eval \
421
+ --model vllm \
422
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
423
+ --apply_chat_template \
424
+ --fewshot_as_multiturn \
425
+ --tasks humaneval_instruct \
426
+ --batch_size auto
427
+
428
+
429
+ lm_eval \
430
+ --model vllm \
431
+ --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
432
+ --apply_chat_template \
433
+ --fewshot_as_multiturn \
434
+ --tasks humaneval_64_instruct \
435
+ --batch_size auto
436
+ ```