alexmarques commited on
Commit
02776bc
·
verified ·
1 Parent(s): 0bdcaf2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +275 -0
README.md ADDED
@@ -0,0 +1,275 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: mistral-common
3
+ language:
4
+ - en
5
+ - fr
6
+ - de
7
+ - es
8
+ - it
9
+ - pt
10
+ - nl
11
+ - hi
12
+ license: apache-2.0
13
+ inference: false
14
+ tags:
15
+ - vllm
16
+ - FP8
17
+ - audio
18
+ - llmcompressor
19
+ license_link: https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md
20
+ base_model: mistralai/Voxtral-Mini-3B-2507
21
+ ---
22
+
23
+ # Voxtral-Mini-3B-2507-FP8-dynamic
24
+
25
+ ## Model Overview
26
+ - **Model Architecture:** VoxtralForConditionalGeneration
27
+ - **Input:** Audio-Text
28
+ - **Output:** Text
29
+ - **Model Optimizations:**
30
+ - **Weight quantization:** FP8
31
+ - **Activation quantization:** FP8
32
+ - **Intended Use Cases:** Voxtral builds upon Ministral-3B with powerful audio understanding capabilities.
33
+ - **Dedicated transcription mode:** Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
34
+ - **Long-form context:** With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
35
+ - **Built-in Q&A and summarization:** Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
36
+ - **Natively multilingual:** Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
37
+ - **Function-calling straight from voice:** Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
38
+ - **Highly capable at text:** Retains the text understanding capabilities of its language model backbone, Ministral-3B
39
+ - **Release Date:** 08/21/2025
40
+ - **Version:** 1.0
41
+ - **Model Developers:** Red Hat
42
+
43
+ Quantized version of [Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507).
44
+
45
+ ### Model Optimizations
46
+
47
+ This model was obtained by quantizing activation and weights of [Voxtral-Mini-3B-2507](https://huggingface.co//Llama-3.3-70B-Instruct) to FP8 data type.
48
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
49
+ Weight quantization also reduces disk size requirements by approximately 50%.
50
+
51
+ Only weights and activations of the linear operators within transformers blocks of the language model are quantized.
52
+ Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
53
+ The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
54
+
55
+ ## Deployment
56
+
57
+ ### Use with vLLM
58
+
59
+ 1. Initialize vLLM server:
60
+ ```
61
+ vllm serve mistralai/Voxtral-Mini-3B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral
62
+ ```
63
+
64
+ 2. Send requests to the server, according to the use case. See the following examples.
65
+
66
+ <details>
67
+ <summary>Audio Instruct</summary>
68
+
69
+ ```python
70
+ from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
71
+ from mistral_common.audio import Audio
72
+ from huggingface_hub import hf_hub_download
73
+
74
+ from openai import OpenAI
75
+
76
+ # Modify OpenAI's API key and API base to use vLLM's API server.
77
+ openai_api_key = "EMPTY"
78
+ openai_api_base = "http://<your-server-host>:8000/v1"
79
+
80
+ client = OpenAI(
81
+ api_key=openai_api_key,
82
+ base_url=openai_api_base,
83
+ )
84
+
85
+ models = client.models.list()
86
+ model = models.data[0].id
87
+
88
+ obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
89
+ bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")
90
+
91
+ def file_to_chunk(file: str) -> AudioChunk:
92
+ audio = Audio.from_file(file, strict=False)
93
+ return AudioChunk.from_audio(audio)
94
+
95
+ text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other?")
96
+ user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()
97
+
98
+ print(30 * "=" + "USER 1" + 30 * "=")
99
+ print(text_chunk.text)
100
+ print("\n\n")
101
+
102
+ response = client.chat.completions.create(
103
+ model=model,
104
+ messages=[user_msg],
105
+ temperature=0.2,
106
+ top_p=0.95,
107
+ )
108
+ content = response.choices[0].message.content
109
+
110
+ print(30 * "=" + "BOT 1" + 30 * "=")
111
+ print(content)
112
+ print("\n\n")
113
+ # The speaker who is more inspiring is the one who delivered the farewell address, as they express
114
+ # gratitude, optimism, and a strong commitment to the nation and its citizens. They emphasize the importance of
115
+ # self-government and active citizenship, encouraging everyone to participate in the democratic process. In contrast,
116
+ # the other speaker provides a factual update on the weather in Barcelona, which is less inspiring as it
117
+ # lacks the emotional and motivational content of the farewell address.
118
+
119
+ # **Differences:**
120
+ # - The farewell address speaker focuses on the values and responsibilities of citizenship, encouraging active participation in democracy.
121
+ # - The weather update speaker provides factual information about the temperature in Barcelona, without any emotional or motivational content.
122
+
123
+
124
+ messages = [
125
+ user_msg,
126
+ AssistantMessage(content=content).to_openai(),
127
+ UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()
128
+ ]
129
+ print(30 * "=" + "USER 2" + 30 * "=")
130
+ print(messages[-1]["content"])
131
+ print("\n\n")
132
+
133
+ response = client.chat.completions.create(
134
+ model=model,
135
+ messages=messages,
136
+ temperature=0.2,
137
+ top_p=0.95,
138
+ )
139
+ content = response.choices[0].message.content
140
+ print(30 * "=" + "BOT 2" + 30 * "=")
141
+ print(content)
142
+ ```
143
+ </details>
144
+
145
+ <details>
146
+ <summary>Transcription</summary>
147
+
148
+ ```python
149
+ from mistral_common.protocol.transcription.request import TranscriptionRequest
150
+ from mistral_common.protocol.instruct.messages import RawAudio
151
+ from mistral_common.audio import Audio
152
+ from huggingface_hub import hf_hub_download
153
+
154
+ from openai import OpenAI
155
+
156
+ # Modify OpenAI's API key and API base to use vLLM's API server.
157
+ openai_api_key = "EMPTY"
158
+ openai_api_base = "http://<your-server-host>:8000/v1"
159
+
160
+ client = OpenAI(
161
+ api_key=openai_api_key,
162
+ base_url=openai_api_base,
163
+ )
164
+
165
+ models = client.models.list()
166
+ model = models.data[0].id
167
+
168
+ obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
169
+ audio = Audio.from_file(obama_file, strict=False)
170
+
171
+ audio = RawAudio.from_audio(audio)
172
+ req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))
173
+
174
+ response = client.audio.transcriptions.create(**req)
175
+ print(response)
176
+ ```
177
+ </details>
178
+
179
+ ## Creation
180
+
181
+ This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.
182
+
183
+ <details>
184
+ <summary>Creation details</summary>
185
+
186
+ ```python
187
+ import torch
188
+ from transformers import VoxtralForConditionalGeneration, AutoProcessor
189
+ from llmcompressor import oneshot
190
+ from llmcompressor.modifiers.quantization import QuantizationModifier
191
+
192
+ # Select model and load it.
193
+ MODEL_ID = "mistralai/Voxtral-Mini-3B-2507"
194
+
195
+ model = VoxtralForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
196
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
197
+
198
+ # Recipe
199
+ recipe = QuantizationModifier(
200
+ targets="Linear",
201
+ scheme="FP8_DYNAMIC",
202
+ ignore=["language_model.lm_head", "re:audio_tower.*" ,"re:multi_modal_projector.*"],
203
+ )
204
+
205
+ # Apply algorithms.
206
+ oneshot(
207
+ model=model,
208
+ recipe=recipe,
209
+ )
210
+
211
+ SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-DYNAMIC"
212
+ model.save_pretrained(SAVE_DIR, save_compressed=True)
213
+ processor.save_pretrained(SAVE_DIR)
214
+ ```
215
+
216
+ After quantization, the model can be converted back into the mistralai format using the `convert_voxtral_hf_to_mistral.py` script included with the model.
217
+ </details>
218
+
219
+ ## Evaluation
220
+
221
+ The model was evaluated on the Fleurs transcription task.
222
+ Recovery is computed with respect to the complement of the word error rate (WER).
223
+
224
+ <table border="1" cellspacing="0" cellpadding="6">
225
+ <tr>
226
+ <th>Benchmark</th>
227
+ <th>Language</th>
228
+ <th>Voxtral-Mini-3B-2507</th>
229
+ <th>Voxtral-Mini-3B-2507-FP8-dynamic<br>(this model)</th>
230
+ <th>Recovery</th>
231
+ </tr>
232
+ <tr>
233
+ <td rowspan="7"><strong>Fleurs<br>WER</strong></td>
234
+ <td>English</td>
235
+ <td>3.89%</td>
236
+ <td>3.95%</td>
237
+ <td>99.9%</td>
238
+ </tr>
239
+ <tr>
240
+ <td>French</td>
241
+ <td>5.07%</td>
242
+ <td>4.86%</td>
243
+ <td>100.2%</td>
244
+ </tr>
245
+ <tr>
246
+ <td>Spanish</td>
247
+ <td>3.63%</td>
248
+ <td>3.55%</td>
249
+ <td>100.1%</td>
250
+ </tr>
251
+ <tr>
252
+ <td>German</td>
253
+ <td>5.00%</td>
254
+ <td>5.01%</td>
255
+ <td>100.0%</td>
256
+ </tr>
257
+ <tr>
258
+ <td>Italian</td>
259
+ <td>2.54%</td>
260
+ <td>2.57%</td>
261
+ <td>100.0%</td>
262
+ </tr>
263
+ <tr>
264
+ <td>Portuguese</td>
265
+ <td>3.85%</td>
266
+ <td>4.03%</td>
267
+ <td>99.8%</td>
268
+ </tr>
269
+ <tr>
270
+ <td>Dutch</td>
271
+ <td>7.01%</td>
272
+ <td>7.20%</td>
273
+ <td>99.8%</td>
274
+ </tr>
275
+ </table>