Quark Team FP8 Mixtral-8x7B Model Overview

Model Information For MLPerf

  • Model Name: Mixtral-7x8b
  • Version: MLPerf v5.1
  • Commit: Close Division Commit
  • Supported Hardware Microarchitecture: AMD MI300/MI325
  • ROCm: 6.4.1
  • Operating System(s): Linux
  • Transformers: 4.46.3
  • Quark: 0.9

Calibration Dataset

This model was built with mistralai Mixtral model by applying AMD-Quark for MXFP4 quantization. The calibration dataset consists of 1024 mixed datasets provided by mlcommons/inference, which includes:

  • 325 GSM8k samples
  • 325 MBXP samples
  • 374 OpenOcra samples

Quantized Tensors

The following tensors are quantized in each decoder:

  • Expert MLP Inputs and Weights (excluding the router)
  • Linear qkv Inputs and Weight
  • KV Cache Entries

Ignored Layers

The following layers are ignored during quantization:

  • *.gate
  • *.o_proj
  • lm_head

Algorithms

AutoSmoothQuant algorithm is applied in weight-activation quantization for better performance.

Quantization Scripts

cd examples/torch/language_modeling/llm_ptq/
MODEL_DIR="mistralai/Mixtral-8x7B-Instruct-v0.1"
DATASET="./mlperf_data/mixtral_8x7b%2F2024.06.06_mixtral_15k_calibration_v4.pkl"
OUTPUT_DIR="amd/Mixtral-8x7B-Instruct-v0.1_FP8_MLPerf_V3"

python3 quantize_quark.py --model_dir "${MODEL}" \
                          --output_dir "${OUTPUT_DIR}" \
                          --dataset "${DATASET}" \
                          --data_type float16 \
                          --multi_gpu \
                          --quant_scheme w_fp8_a_fp8 \
                          --kv_cache_dtype fp8 \
                          --num_calib_data 1024 \
                          --seq_len 1024 \
                          --min_kv_scale 1.0 \
                          --model_export hf_format \
                          --custom_mode fp8 \
                          --quant_algo autosmoothquant \
                          --exclude_layers "lm_head" "*.gate" "*.o_proj"

Model Performance Comparison

Metric Baseline Accuracy Target (%) FP8 Quant Accuracy (%)
GSM8K (Math) 73.66 73.18 (99.34%)
Open Orca (Chat)
- Rouge1 45.5989 45.4362 (99.64%)
- Rouge2 23.3526 23.168 (99.21%)
- RougeL 30.4608 30.2922 (99.45%)
MBXP (Code) 60.16 60.08 (99.87%)

License

Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month
12
Safetensors
Model size
46.7B params
Tensor type
F16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/Mixtral-8x7B-Instruct-v0.1_FP8_MLPerf_V3

Quantized
(38)
this model