--- language: - en - fr - es - pt - hi - de - nl - it base_model: - mistralai/Voxtral-Mini-3B-2507 pipeline_tag: automatic-speech-recognition tags: - voxtral - fp8 - quantized - multimodal - conversational - text-generation-inference - automatic-speech-recognition - automatic-speech-translation - audio-text-to-text - video-text-to-text - compressed-tensors license: apache-2.0 license_name: apache-2.0 name: RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic description: A quantized version of the Voxtral-Mini-3B-2507 model, optimized for speech transcription, translation, and audio understanding with FP8 data type quantization. readme: https://huggingface.co/RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic/main/README.md tasks: - automatic-speech-recognition - automatic-speech-translation - audio-to-text - text-to-text provider: RedHatAI license_link: https://www.apache.org/licenses/LICENSE-2.0 --- # Voxtral-Mini-3B-2507-FP8-dynamic ## Model Overview - **Model Architecture:** VoxtralForConditionalGeneration - **Input:** Audio-Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** FP8 - **Activation quantization:** FP8 - **Intended Use Cases:** Voxtral builds upon Ministral-3B with powerful audio understanding capabilities. - **Dedicated transcription mode:** Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly - **Long-form context:** With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding - **Built-in Q&A and summarization:** Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models - **Natively multilingual:** Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian) - **Function-calling straight from voice:** Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents - **Highly capable at text:** Retains the text understanding capabilities of its language model backbone, Ministral-3B - **Release Date:** 08/21/2025 - **Version:** 1.0 - **Model Developers:** Red Hat Quantized version of [Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507). ### Model Optimizations This model was obtained by quantizing activation and weights of [Voxtral-Mini-3B-2507](https://huggingface.co//Llama-3.3-70B-Instruct) to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks of the language model are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization. ## Deployment ### Use with vLLM 1. Initialize vLLM server: ``` vllm serve RedHatAI/Voxtral-Mini-3B-2507-FP8-dynamic --tokenizer_mode mistral --config_format mistral --load_format mistral ``` 2. Send requests to the server, according to the use case. See the following examples.
Audio Instruct ```python from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio from mistral_common.audio import Audio from huggingface_hub import hf_hub_download from openai import OpenAI # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset") bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset") def file_to_chunk(file: str) -> AudioChunk: audio = Audio.from_file(file, strict=False) return AudioChunk.from_audio(audio) text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other?") user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai() print(30 * "=" + "USER 1" + 30 * "=") print(text_chunk.text) print("\n\n") response = client.chat.completions.create( model=model, messages=[user_msg], temperature=0.2, top_p=0.95, ) content = response.choices[0].message.content print(30 * "=" + "BOT 1" + 30 * "=") print(content) print("\n\n") # The speaker who is more inspiring is the one who delivered the farewell address, as they express # gratitude, optimism, and a strong commitment to the nation and its citizens. They emphasize the importance of # self-government and active citizenship, encouraging everyone to participate in the democratic process. In contrast, # the other speaker provides a factual update on the weather in Barcelona, which is less inspiring as it # lacks the emotional and motivational content of the farewell address. # **Differences:** # - The farewell address speaker focuses on the values and responsibilities of citizenship, encouraging active participation in democracy. # - The weather update speaker provides factual information about the temperature in Barcelona, without any emotional or motivational content. messages = [ user_msg, AssistantMessage(content=content).to_openai(), UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai() ] print(30 * "=" + "USER 2" + 30 * "=") print(messages[-1]["content"]) print("\n\n") response = client.chat.completions.create( model=model, messages=messages, temperature=0.2, top_p=0.95, ) content = response.choices[0].message.content print(30 * "=" + "BOT 2" + 30 * "=") print(content) ```
Transcription ```python from mistral_common.protocol.transcription.request import TranscriptionRequest from mistral_common.protocol.instruct.messages import RawAudio from mistral_common.audio import Audio from huggingface_hub import hf_hub_download from openai import OpenAI # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset") audio = Audio.from_file(obama_file, strict=False) audio = RawAudio.from_audio(audio) req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed")) response = client.audio.transcriptions.create(**req) print(response) ```
## Creation This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.
Creation details ```python import torch from transformers import VoxtralForConditionalGeneration, AutoProcessor from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier # Select model and load it. MODEL_ID = "mistralai/Voxtral-Mini-3B-2507" model = VoxtralForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16) processor = AutoProcessor.from_pretrained(MODEL_ID) # Recipe recipe = QuantizationModifier( targets="Linear", scheme="FP8_DYNAMIC", ignore=["language_model.lm_head", "re:audio_tower.*" ,"re:multi_modal_projector.*"], ) # Apply algorithms. oneshot( model=model, recipe=recipe, processor=processor, ) SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-dynamic" model.save_pretrained(SAVE_DIR, save_compressed=True) processor.save_pretrained(SAVE_DIR) ``` After quantization, the model can be converted back into the mistralai format using the `convert_voxtral_hf_to_mistral.py` script included with the model.
## Evaluation The model was evaluated on the Fleurs transcription task. Recovery is computed with respect to the complement of the word error rate (WER).
Benchmark Language Voxtral-Mini-3B-2507 Voxtral-Mini-3B-2507-FP8-dynamic
(this model)
Recovery
Fleurs
WER
English 3.89% 3.95% 99.9%
French 5.07% 4.86% 100.2%
Spanish 3.63% 3.55% 100.1%
German 5.00% 5.01% 100.0%
Italian 2.54% 2.57% 100.0%
Portuguese 3.85% 4.03% 99.8%
Dutch 7.01% 7.20% 99.8%