--- license: mit base_model: - deepseek-ai/DeepSeek-V3 --- # Model Overview - **Model_Architecture:** DeepSeek V3 - **Input:** Text - **Output:** Text - **Supported_Hardware_Microarchitecture:** AMD MI350/MI355 - **ROCm:** "7.0" - **Operating Systems:** Linux - **Inference Engine:** vLLM - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) - **Quantization:** - **Weight:** - Type: OCP MXFP4 - Mode: Static - **Activation:** - Type: OCP MXFP4 - Mode: Dynamic - **KV_Cache:** - Type: OCP FP8 - Mode: Static - **Calibration_Dataset:** Pile This model was built with DeepSeek by applying AMD-Quark for MXFP4 quantization. # Model Quantization The model was quantized from unsloth/DeepSeek-V3-0324-BF16 using AMD-Quark. Weights and activations were quantized to MXFP4, and KV caches were quantized to FP8. The AutoSmoothQuant algorithm was applied to enhance accuracy during quantization. **Quantization Scripts** ```console cd Quark/examples/torch/language_modeling/llm_ptq/ python3 quantize_quark.py \ --model_dir "/deepseek-ai/DeepSeek-V3-0324-BF16/" \ --quant_scheme "w_mxfp4_a_mxfp4" \ --quant_algo_config_file "llm_ptq/models/deepseekv2v3/autosmoothquant_config.json" \ --num_calib_data 128 \ --exclude_layers "$exclude_layers"\ --multi_gpu true \ --quant_algo "autosmoothquant" \ --model_export "hf_format" \ --output_dir "$output_dir" ``` # Deployment - Backend: vLLM - Description: This model can be deployed efficiently using the vLLM backend. # Evaluation - Tasks: - Wikitext - GSM8K - Framework: lm-evaluation-harness - Engine: vLLM # Accuracy - wikitext-ppl
3.33074593544006 **Reproduction Command** ```console Wikitext: lm_eval \ --model vllm \ --model_args pretrained="amd/DeepSeek-V3-0324-WMXFP4-AMXFP4-MoE-Quant-ASQ",gpu_memory_utilization=0.85,tensor_parallel_size=8,kv_cache_dtype='fp8' \ --tasks wikitext \ --fewshot_as_multiturn \ --apply_chat_template \ --num_fewshot 5 \ --batch_size auto GSM8K: lm_eval \ --model vllm \ --model_args pretrained="amd/DeepSeek-V3-0324-WMXFP4-AMXFP4-MoE-Quant-ASQ",gpu_memory_utilization=0.85,tensor_parallel_size=8,kv_cache_dtype='fp8' \ --tasks gsm8k_llama \ --fewshot_as_multiturn \ --apply_chat_template \ --num_fewshot 8 \ --batch_size auto ``` # License Modifications copyright(c) 2024 Advanced Micro Devices,Inc. All rights reserved.