RedHatAI
/

Llama-4-Scout-17B-16E-Instruct-NVFP4

+---
+tags:
+- fp4
+- vllm
+language:
+- en
+- de
+- fr
+- it
+- pt
+- hi
+- es
+- th
+pipeline_tag: text-generation
+license: llama3.1
+base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct
+---
+# Llama-4-Scout-17B-16E-Instruct-NVFP4
+## Model Overview
+- **Model Architecture:** Meta-Llama-3.1
+  - **Input:** Text
+  - **Output:** Text
+- **Model Optimizations:**
+  - **Weight quantization:** FP4
+  - **Activation quantization:** FP4
+- **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct), this models is intended for assistant-like chat.
+- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
+- **Release Date:** 7/15/25
+- **Version:** 1.0
+- **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
+- **Model Developers:** RedHatAI
+This model is a quantized version of [Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct).
+It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model.
+### Model Optimizations
+This model was obtained by quantizing the weights and activations of [Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) to FP4 data type, ready for inference with vLLM>=0.9.1
+This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%.
+Only the weights of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor).
+## Deployment
+### Use with vLLM
+This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
+```python
+from vllm import LLM, SamplingParams
+from transformers import AutoTokenizer
+model_id = "RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4"
+number_gpus = 2
+sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+messages = [
+    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
+    {"role": "user", "content": "Who are you?"},
+]
+prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
+llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
+outputs = llm.generate(prompts, sampling_params)
+generated_text = outputs[0].outputs[0].text
+print(generated_text)
+```
+vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
+## Creation
+This model was created by applying [LLM Compressor with calibration samples from neuralmagic/calibration dataset](https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/llama4_example.py), as presented in the code snipet below.
+```python
+import torch
+from datasets import load_dataset
+from transformers import Llama4ForConditionalGeneration, Llama4Processor
+from llmcompressor import oneshot
+from llmcompressor.modeling import prepare_for_calibration
+from llmcompressor.modifiers.quantization import GPTQModifier
+# Select model and load it.
+model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
+model = Llama4ForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
+processor = Llama4Processor.from_pretrained(model_id)
+# We update `Llama4TextMoe` modules with custom `SequentialLlama4TextMoe`.
+# This change allows compatibility with vllm.
+# To apply your own custom module for experimentation, consider updating
+# `SequentialLlama4TextMoe` under llmcompressor/modeling/llama4.py
+model = prepare_for_calibration(model)
+DATASET_ID = "neuralmagic/calibration"
+NUM_CALIBRATION_SAMPLES = 512
+MAX_SEQUENCE_LENGTH = 8192
+ds = load_dataset(DATASET_ID, name="LLM", split=f"train[:{NUM_CALIBRATION_SAMPLES}]")
+def preprocess_function(example):
+    messgages = []
+    for message in example["messages"]:
+        messgages.append(
+            {
+                "role": message["role"],
+                "content": [{"type": "text", "text": message["content"]}],
+            }
+        )
+    return processor.apply_chat_template(
+        messgages,
+        return_tensors="pt",
+        padding=False,
+        truncation=True,
+        max_length=MAX_SEQUENCE_LENGTH,
+        tokenize=True,
+        add_special_tokens=False,
+        return_dict=True,
+        add_generation_prompt=False,
+    )
+ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)
+def data_collator(batch):
+    assert len(batch) == 1
+    return {
+        key: torch.tensor(value)
+        if key != "pixel_values"
+        else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
+        for key, value in batch[0].items()
+    }
+# Configure the quantization algorithm to run.
+recipe = GPTQModifier(
+    targets="Linear",
+    scheme="W4A16",
+    ignore=[
+        "re:.*lm_head",
+        "re:.*self_attn",
+        "re:.*router",
+        "re:vision_model.*",
+        "re:multi_modal_projector.*",
+        "Llama4TextAttention",
+    ],
+)
+# Apply algorithms.
+# due to the large size of Llama4, we specify sequential targets such that
+# only one MLP is loaded into GPU memory at a time
+oneshot(
+    model=model,
+    dataset=ds,
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+    data_collator=data_collator,
+    sequential_targets=["Llama4TextMLP"],
+)
+# Save to disk compressed.
+SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+processor.save_pretrained(SAVE_DIR)
+```
+## Evaluation
+This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval_64 benchmarks. All evaluations were conducted using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness).
+<table>
+  <thead>
+    <tr>
+      <th>Category</th>
+      <th>Metric</th>
+      <th>Llama-4-Scout-17B-16E-Instruct (A100)</th>
+      <th>Llama-4-Scout-17B-16E-Instruct-NVFP4 (B200)</th>
+      <th>Recovery (%)</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td rowspan="8"><b>OpenLLM V1</b></td>
+      <td>ARC Challenge (LLaMA)</td>
+      <td>93.39</td>
+      <td>92.10</td>
+      <td>98.62%</td>
+    </tr>
+    <tr>
+      <td>GSM8K (LLaMA)</td>
+      <td>92.87</td>
+      <td>94.31</td>
+      <td>101.55%</td>
+    </tr>
+    <tr>
+      <td>MMLU (LLaMA)</td>
+      <td>81.01</td>
+      <td>79.37</td>
+      <td>97.98%</td>
+    </tr>
+    <tr>
+      <td>MMLU-CoT (LLaMA)</td>
+      <td>85.99</td>
+      <td>84.58</td>
+      <td>98.36%</td>
+    </tr>
+    <tr>
+      <td>Hellaswag</td>
+      <td>79.13</td>
+      <td>78.47</td>
+      <td>99.17%</td>
+    </tr>
+    <tr>
+      <td>TruthfulQA-mc2</td>
+      <td>62.53</td>
+      <td>60.83</td>
+      <td>97.28%</td>
+    </tr>
+    <tr>
+      <td>Winogrande</td>
+      <td>73.56</td>
+      <td>73.01</td>
+      <td>99.25%</td>
+    </tr>
+    <tr>
+      <td><b>Average</b></td>
+      <td><b>81.21</b></td>
+      <td><b>80.38</b></td>
+      <td><b>98.89%</b></td>
+    </tr>
+    <tr>
+      <td rowspan="7"><b>OpenLLM V2</b></td>
+      <td>MMLU-Pro</td>
+      <td>55.64</td>
+      <td>53.84</td>
+      <td>96.76%</td>
+    </tr>
+    <tr>
+      <td>IFEval</td>
+      <td>89.09</td>
+      <td>89.93</td>
+      <td>100.94%</td>
+    </tr>
+    <tr>
+      <td>BBH</td>
+      <td>65.14</td>
+      <td>64.00</td>
+      <td>98.25%</td>
+    </tr>
+    <tr>
+      <td>Math-Hard</td>
+      <td>52.64</td>
+      <td>56.12</td>
+      <td>106.61%</td>
+    </tr>
+    <tr>
+      <td>GPQA</td>
+      <td>32.21</td>
+      <td>31.88</td>
+      <td>98.98%</td>
+    </tr>
+    <tr>
+      <td>MuSR</td>
+      <td>42.20</td>
+      <td>42.99</td>
+      <td>101.87%</td>
+    </tr>
+    <tr>
+      <td><b>Average</b></td>
+      <td><b>56.15</b></td>
+      <td><b>56.46</b></td>
+      <td><b>100.55%</b></td>
+    </tr>
+    <tr>
+      <td><b>Coding</b></td>
+      <td>HumanEval Instruct pass@1</td>
+      <td>81.71</td>
+      <td>76.22</td>
+      <td>93.29%</td>
+    </tr>
+    <tr>
+      <td rowspan="5"></td>
+      <td>HumanEval 64 Instruct pass@2</td>
+      <td>83.49</td>
+      <td>81.10</td>
+      <td>97.14%</td>
+    </tr>
+    <tr>
+      <td>HumanEval 64 Instruct pass@8</td>
+      <td>87.71</td>
+      <td>88.66</td>
+      <td>101.08%</td>
+    </tr>
+    <tr>
+      <td>HumanEval 64 Instruct pass@16</td>
+      <td>88.71</td>
+      <td>90.11</td>
+      <td>101.58%</td>
+    </tr>
+    <tr>
+      <td>HumanEval 64 Instruct pass@32</td>
+      <td>89.38</td>
+      <td>90.91</td>
+      <td>101.71%</td>
+    </tr>
+    <tr>
+      <td>HumanEval 64 Instruct pass@64</td>
+      <td>89.63</td>
+      <td>91.46</td>
+      <td>102.04%</td>
+    </tr>
+  </tbody>
+</table>
+### Reproduction
+The results were obtained using the following commands:
+#### MMLU_LLAMA
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
+  --tasks mmlu_llama \
+  --apply_chat_template \
+  --fewshot_as_multiturn \
+  --batch_size auto
+```
+#### MMLU_COT_LLAMA
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
+  --tasks mmlu_cot_llama \
+  --apply_chat_template \
+  --fewshot_as_multiturn \
+  --batch_size auto
+```
+#### ARC-Challenge
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
+  --tasks arc_challenge_llama \
+  --apply_chat_template \
+  --batch_size auto
+```
+#### GSM-8K
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
+  --tasks gsm8k_llama \
+  --apply_chat_template \
+  --fewshot_as_multiturn \
+  --batch_size auto
+```
+#### Hellaswag
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
+  --tasks hellaswag \
+  --apply_chat_template \
+  --fewshot_as_multiturn \
+  --batch_size auto
+```
+#### Winogrande
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
+  --tasks winogrande \
+  --apply_chat_template \
+  --fewshot_as_multiturn \
+  --batch_size auto
+```
+#### TruthfulQA
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True \
+  --tasks truthfulqa \
+  --apply_chat_template \
+  --fewshot_as_multiturn \
+  --batch_size auto
+```
+#### OpenLLM v2
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
+  --apply_chat_template \
+  --fewshot_as_multiturn \
+  --tasks leaderboard \
+  --batch_size auto
+```
+#### HumanEval and HumanEval_64
+```
+lm_eval \
+  --model vllm \
+  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
+  --apply_chat_template \
+  --fewshot_as_multiturn \
+  --tasks humaneval_instruct \
+  --batch_size auto
+lm_eval \
+  --model vllm \
+  --model_args pretrained="RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
+  --apply_chat_template \
+  --fewshot_as_multiturn \
+  --tasks humaneval_64_instruct \
+  --batch_size auto
+```