|
--- |
|
language: |
|
- en |
|
- fr |
|
- es |
|
- de |
|
- it |
|
- pt |
|
- nl |
|
license: mit |
|
library_name: transformers |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- transformers.js |
|
widget: |
|
- example_title: LibriSpeech sample 1 |
|
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac |
|
- example_title: LibriSpeech sample 2 |
|
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
|
|
# Whisper-Large-V3-Distil-Multi7-v0.2 |
|
|
|
A multilingual distilled Whisper model with 2 decoder layers, supporting 7 European languages: English, French, Spanish, German, Italian, Portuguese, and Dutch. |
|
|
|
The model was trained during my work on [Distil-Large-v3.5](https://huggingface.co/distil-whisper/distil-large-v3.5). |
|
|
|
A notable feature is its native support for **code-switching**. The model has the ability to switch languages within a single segment transcription by automatically producing a new language token when it detects a language change (as demonstrated in the following example). |
|
|
|
*The `<|yue|>` language token has been repurposed during training to act as an automatic language detection token that enables code-switching during inference. To use this feature, simply set the language parameter to `cantonese` (used by default).* |
|
|
|
The model's performance is below both the monolingual distilled version and Whisper-Large-v3-Turbo. Future work should investigate better training procedures and possibly incorporate more data to effectively compress multilingual capabilities into a single model. |
|
|
|
## Table of Contents |
|
|
|
- [Usage](#usage) |
|
- [Evaluation](#evaluation) |
|
|
|
## Usage |
|
|
|
```python |
|
import torch |
|
from datasets import load_dataset |
|
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor |
|
|
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 |
|
|
|
# Load model |
|
model_name_or_path = "bofenghuang/whisper-large-v3-distil-multi7-v0.2" |
|
processor = AutoProcessor.from_pretrained(model_name_or_path) |
|
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, torch_dtype=torch_dtype) |
|
model.to(device) |
|
|
|
# Example audio |
|
dataset = load_dataset("bofenghuang/asr-dummy", "cs", split="test") |
|
sample, text = dataset[0]["audio"], dataset[0]["text"] |
|
|
|
# Ground truth text |
|
print(text) |
|
# Aber sei ihnen nicht böse, Habibi, vergib ihnen, sie vergaßen die Liebe, sie vergaßen die Bibel, |
|
# wünsch ihnen den Frieden. Nous allons construire des radiotélescopes géants comme celui-ci, |
|
# qui est mon préféré. Questa è un'immagine di Cairo Open City, una mostra che il museo Folkwang di |
|
# Essen ha dedicato al ruolo della mobile photography nella primavera Araba. |
|
|
|
# Extract feautres |
|
input_features = processor( |
|
sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt" |
|
).input_features |
|
|
|
|
|
# Generate tokens |
|
predicted_ids = model.generate( |
|
input_features.to(device, dtype=torch_dtype), |
|
max_new_tokens=128, |
|
) |
|
|
|
# Detokenize to text |
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
|
print(transcription) |
|
# Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. |
|
# Wünsche ihnen dem Frieden. Nous allons construire des radiotelescopes géants, comme celui-ci qui |
|
# est mon préféré. Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen |
|
# ha dedicado al ruolo de la mobile fotografía en la primavera árabe. |
|
|
|
# Dive in generated tokens |
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)[0] |
|
print(transcription) |
|
# <|de|> Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. |
|
# Wünsche ihnen dem Frieden.<|fr|> Nous allons construire des radiotelescopes géants, comme celui-ci qui |
|
# est mon préféré.<|es|> Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen |
|
# ha dedicado al ruolo de la mobile fotografía en la primavera árabe. |
|
``` |
|
|
|
## Evaluation |
|
|
|
### English |
|
|
|
| Model | LIUM_tedlium | mcv17 | voxpopuli | fleurs | kensho_spgispeech | librispeech-test_clean | librispeech-test_other | speechcolab_gigaspeech | |
|
| ------------------------------------------ | ------------ | ----- | --------- | ------ | ----------------- | ---------------------- | ---------------------- | ---------------------- | |
|
| openai/whisper-large-v3 | 10.58 | 10.13 | 8.93 | 5.72 | 2.95 | 1.87 | 3.58 | 10.07 | |
|
| openai/whisper-large-v3-turbo | 10.20 | 11.74 | 11.78 | 6.13 | 2.95 | 1.98 | 3.94 | 10.11 | |
|
| distil-whisper/distil-large-v3 | 8.93 | 12.41 | 7.72 | 7.59 | 3.25 | 2.42 | 5.11 | 10.08 | |
|
| distil-whisper/distil-large-v3.5 | 8.65 | 11.07 | 7.54 | 6.74 | 2.86 | 2.28 | 4.94 | 9.84 | |
|
| bofenghuang/whisper-large-v3-distil-multi4-v0.2 | 8.88 | 11.33 | 7.60 | 6.97 | 3.03 | 2.51 | 5.24 | 10.12 | |
|
| bofenghuang/whisper-large-v3-distil-multi7-v0.2 | 9.36 | 11.32 | 7.65 | 7.02 | 2.99 | 2.46 | 5.24 | 10.06 | |
|
|
|
### French |
|
|
|
| Model | mcv17 | mls | voxpopuli | mtedx | af_accented | fleurs | hf_dev_data_chunk30 | hf_dev_data_sequential | mtedx_chunk30 | mtedx_sequential | |
|
| ------------------------------------------- | ----- | ---- | --------- | ----- | ----------- | ------ | ------------------- | ---------------------- | ------------- | ---------------- | |
|
| openai/whisper-large-v3 | 10.98 | 4.69 | 11.15 | 8.67 | 7.51 | 5.4 | 9.87 | 8.97 | 9 | 8.01 | |
|
| openai/whisper-large-v3-turbo | 12.41 | 5.1 | 12.21 | 9.87 | 8.37 | 5.48 | 10.12 | 9 | 8.49 | 8.39 | |
|
| bofenghuang/whisper_large_v3_distil_fr_v0.2 | 11.1 | 5 | 10.68 | 8.75 | 7.09 | 6.35 | 9.44 | 9.84 | 8.94 | 8.93 | |
|
| bofenghuang/whisper-large-v3-distil-multi4-v0.2 | 11.96 | 6.04 | 11.07 | 9.16 | 7.99 | 7.10 | 10.42 | 12.61 | 9.06 | 11.75 | |
|
| bofenghuang/whisper-large-v3-distil-multi7-v0.2 | 12.19 | 6.2 | 11.29 | 9.13 | 8.26 | 7.17 | 10.04 | 12.26 | 8.93 | 11.56 | |
|
|
|
### Spanish |
|
|
|
| Model | mcv17 | mls | voxpopuli | mtedx | fleurs | hf_dev_data_chunk30 | hf_dev_data_sequential | mtedx_chunk30 | mtedx_sequential | |
|
| ------------------------------------------ | ----- | ---- | --------- | ----- | ------ | ------------------- | ---------------------- | ------------- | ---------------- | |
|
| openai/whisper-large-v3 | 4.91 | 3.97 | 11.06 | 6.52 | 4.22 | 10.85 | 10.36 | 5.90 | 5.22 | |
|
| openai/whisper-large-v3-turbo | 5.74 | 4.41 | 16.02 | 6.66 | 4.59 | 11.55 | 10.68 | 6.46 | 5.41 | |
|
| bofenghuang/whisper-large-v3-distil-multi4-v0.2 | 5.58 | 4.34 | 8.52 | 7.43 | 5.20 | 11.26 | 13.43 | 5.69 | 8.95 | |
|
| bofenghuang/whisper-large-v3-distil-multi7-v0.2 | 5.70 | 4.35 | 8.55 | 7.56 | 5.15 | 11.45 | 13.54 | 5.84 | 8.27 | |
|
|
|
### German |
|
|
|
| Model | mcv17 | mls | voxpopuli | mtedx | fleurs | hf_dev_data_chunk30 | hf_dev_data_sequential | mtedx_chunk30 | mtedx_sequential | |
|
| ------------------------------------------ | ----- | ---- | --------- | ----- | ------ | ------------------- | ---------------------- | ------------- | ---------------- | |
|
| openai/whisper-large-v3 | 6.11 | 5.60 | 17.75 | 19.63 | 5.92 | 11.21 | 10.35 | 17.64 | 17.76 | |
|
| openai/whisper-large-v3-turbo | 7.45 | 6.43 | 20.48 | 20.00 | 6.45 | 10.57 | 9.70 | 18.04 | 18.37 | |
|
| bofenghuang/whisper-large-v3-distil-multi4-v0.2 | 7.31 | 6.45 | 12.41 | 21.48 | 8.20 | 11.04 | 13.55 | 19.54 | 21.76 | |
|
| bofenghuang/whisper-large-v3-distil-multi7-v0.2 | 7.57 | 6.67 | 12.42 | 21.95 | 8.28 | 11.21 | 13.84 | 19.90 | 21.67 | |
|
|
|
### Italian |
|
|
|
| Model | mcv17 | mls | voxpopuli | mtedx | fleurs | hf_dev_data_chunk30 | hf_dev_data_sequential | mtedx_chunk30 | mtedx_sequential | |
|
| ------------------------------------------- | ----- | ----- | --------- | ----- | ------ | ------------------- | ---------------------- | ------------- | ---------------- | |
|
| openai/whisper-large-v3 | 5.71 | 9.58 | 28.45 | 7.21 | 4.28 | 6.95 | 6.37 | 6.83 | 7.28 | |
|
| openai/whisper-large-v3-turbo | 6.77 | 10.64 | 30.69 | 7.41 | 4.69 | 6.88 | 6.52 | 7.98 | 7.37 | |
|
| bofenghuang/whisper_large_v3_distil_it_v0.2 | 6.15 | 9.22 | 17.27 | 7.52 | 5.26 | 6.06 | 6.99 | 7.84 | 8.42 | |
|
| bofenghuang/whisper-large-v3-distil-multi7-v0.2 | 6.78 | 11.42 | 17.53 | 8.07 | 5.68 | 7.04 | 9.51 | 7.51 | 10.47 | |
|
|
|
### Portuguese |
|
|
|
| Model | mcv17 | mls | mtedx | fleurs | hf_dev_data_chunk30 | hf_dev_data_sequential | mtedx_chunk30 | mtedx_sequential | |
|
| ------------------------------------------ | ----- | ---- | ----- | ------ | ------------------- | ---------------------- | ------------- | ---------------- | |
|
| openai/whisper-large-v3 | 6.76 | 7.04 | 8.91 | 5.86 | 12.11 | 12.39 | 8.70 | 8.98 | |
|
| openai/whisper-large-v3-turbo | 7.66 | 6.64 | 8.84 | 6.11 | 12.42 | 11.62 | 10.97 | 9.04 | |
|
| bofenghuang/whisper-large-v3-distil-multi7-v0.2 | 8.31 | 6.75 | 10.11 | 7.10 | 12.74 | 14.97 | 9.64 | 11.78 | |
|
|
|
### Dutch |
|
|
|
| Model | mcv17 | mls | voxpopuli | fleurs | |
|
| ------------------------------------------ | ----- | ----- | --------- | ------ | |
|
| openai/whisper-large-v3 | 4.51 | 66.95 | 23.35 | 6.99 | |
|
| openai/whisper-large-v3-turbo | 6.16 | 52.37 | 27.42 | 7.59 | |
|
| bofenghuang/whisper-large-v3-distil-multi7-v0.2 | 6.76 | 14.82 | 14.92 | 10.86 | |
|
|