File size: 9,423 Bytes
7daa40c
 
 
 
 
 
 
 
 
 
 
 
 
 
10ded47
 
7daa40c
 
10ded47
7daa40c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10ded47
7daa40c
 
 
 
 
 
 
 
 
 
 
10ded47
7daa40c
 
10ded47
 
7daa40c
 
 
 
 
 
0b3f0ad
7daa40c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce588ec
 
7daa40c
 
 
 
 
 
ce588ec
 
 
 
7daa40c
 
ce588ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7daa40c
 
ce588ec
 
7daa40c
 
 
 
 
 
 
 
 
 
 
 
ce588ec
 
7daa40c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce588ec
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
language: en
model_name: Wav2Vec2-BART (Base) English ASR - VoxPopuli Best WER
license: mit
tags:
- automatic-speech-recognition
- speech-encoder-decoder
- wav2vec2
- bart
- english
- voxpopuli
- generated_from_trainer
- audio
- master-thesis
- pytorch
- transformers
datasets:
- facebook/voxpopuli
base_model:
- facebook/wav2vec2-base-en-voxpopuli-v2
- facebook/bart-base
model-index:
- name: matejhornik/wav2vec2-base_bart-base_voxpopuli-en
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: VoxPopuli (English, Test)
      type: facebook/voxpopuli
      config: en
      split: test
    metrics:
    - name: WER
      type: wer
      value: 8.848048503220916
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: VoxPopuli (English, Validation)
      type: facebook/voxpopuli
      config: en
      split: validation
    metrics:
    - name: WER
      type: wer
      value: 8.554638942253362
pipeline_tag: automatic-speech-recognition
library_name: transformers
metrics:
- wer
---

# Wav2Vec2-BART (Base) for English ASR on VoxPopuli - Best WER from Master's Thesis

This repository contains the checkpoint for a `SpeechEncoderDecoderModel` fine-tuned for Automatic Speech Recognition (ASR) on the English portion of the VoxPopuli dataset. This model achieved the **best Word Error Rate (WER) of 8.85% on the VoxPopuli English test set** within the experimental framework of the Master's thesis "Effective Training of Neural Networks for Automatic Speech Recognition" by Matej Horník.

The model leverages a pre-trained **Wav2Vec2 (Base)** encoder [`facebook/wav2vec2-base-en-voxpopuli-v2`](https://huggingface.co/facebook/wav2vec2-base-en-voxpopuli-v2) and a pre-trained **BART (Base)** decoder [`facebook/bart-base`](https://huggingface.co/facebook/bart-base).

## Thesis Context

This model is a direct result of work conducted for the Master's thesis:

*   **Title:** Effective Training of Neural Networks for Automatic Speech Recognition
*   **Author:** Matej Horník
*   **Supervisor:** Ing. Alexander Polok
*   **Institution:** Brno University of Technology, Faculty of Information Technology
*   **Year:** 2025
*   **Thesis Link:** [Link to thesis PDF](https://www.vut.cz/en/students/final-thesis/detail/164401)

> [!NOTE] 
> Link will be available after the thesis defense.

### Thesis Abstract (English)
This master's thesis focuses on improving the training efficiency and performance of encoder-decoder transformer models for Automatic Speech Recognition (ASR). It investigates the impact of initialization strategies using pre-trained components (Wav2Vec2, BART), the role of convolutional adapters, and Parameter-Efficient Fine-tuning (PEFT) methods like LoRA and DoRA. Experiments on LibriSpeech and VoxPopuli datasets confirmed that full pre-trained initialization is crucial for best Word Error Rate (WER) and convergence. An optimal number of adapters improved performance, while PEFT (especially LoRA) significantly reduced trainable parameters with comparable accuracy. Domain-specific encoder pre-training proved beneficial, and the encoder-decoder model outperformed a CTC baseline in accuracy, offering practical insights for efficient ASR training.

## Model Details

*   **Encoder:** `facebook/wav2vec2-base-en-voxpopuli-v2`. This is a Wav2Vec2 (Base) model pre-trained by Facebook on 24.1k hours of unlabeled English VoxPopuli data.
*   **Decoder:** `facebook/bart-base`. This is a standard BART (Base) model.
*   **Architecture:** `SpeechEncoderDecoderModel` from Hugging Face Transformers.
*   **Adapters:** 3 convolutional adapter layers were added to the encoder's output to better align its temporal resolution with the BART decoder's input requirements.
*   **Feature Extractor:** The Wav2Vec2 feature extractor (initial CNN layers) was kept frozen during fine-tuning, as experiments showed this maintained performance while reducing trainable parameters.

### Initial Model Construction
The base model (before fine-tuning for this specific result) was constructed by combining the pre-trained `facebook/wav2vec2-base-en-voxpopuli-v2` (encoder) and `facebook/bart-base` (decoder) using `SpeechEncoderDecoderModel.from_encoder_decoder_pretrained`. To create the model, code is provided in [create_model.py](create_model.py).

```bash
python create_model.py
```


## Training Data

### Data
The model was fine-tuned on the `train` split of the English portion of the [VoxPopuli dataset](https://huggingface.co/datasets/facebook/voxpopuli) (`facebook/voxpopuli`, config `en`).
Audio data was resampled to 16kHz. Text transcriptions were tokenized using the BART tokenizer and lowercased.

### Procedure
The model was fine-tuned using modified [`run_speech_recognition_seq2seq.py`](https://github.com/hornikmatej/thesis_mit/blob/main/run_speech_recognition_seq2seq.py) script (provided in the thesis materials, based on Hugging Face's example scripts).

**Key Hyperparameters:**
* **Optimizer:** AdamW
* **Learning Rate:** `1e-4`
* **LR Scheduler:** `cosine_with_min_lr` (min\_lr: `5e-9`)
* **Warmup Steps:** 2000
* **Batch Size (per device):** 96
* **Gradient Accumulation Steps:** 1
* **Number of Epochs:** 20
* **Weight Decay:** 0.01
* **Label Smoothing Factor:** 0.05
* **Mixed Precision:** bf16
* **SpecAugment:** Applied during training
    * `mask_time_prob`: 0.25, `mask_time_length`: 30, `mask_time_min_masks`: 2
    * `mask_feature_prob`: 0.3, `mask_feature_length`: 30, `mask_feature_min_masks`: 1
* **Feature Extractor:** Frozen

The full training command can be found in the [thesis materials](https://github.com/hornikmatej/thesis_mit/blob/main/run_scripts/voxpopuli_best.sh), including the specific arguments used.


## Evaluation

The model achieves the following Word Error Rate (WER) on the VoxPopuli English dataset:

| Dataset Split | WER (%) | Loss  |
|---------------|---------|-------|
| Validation    | 8.55%   | 1.056 |
| Test          | 8.85%   | 1.076 |


For detailed training logs, metrics, and visualizations, please refer to the Weights & Biases report:

[![alt text](https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg)](https://api.wandb.ai/links/xhorni20-fitvut/2018dikj)

## How to Use

You can use this model for inference with the Hugging Face `transformers` library.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hornikmatej/thesis_mit/blob/main/graphs/colab_ntb.ipynb)

```python
from transformers import SpeechEncoderDecoderModel, AutoProcessor
import torch
from datasets import load_dataset

MODEL_ID = "matejhornik/wav2vec2-base_bart-base_voxpopuli-en"
DATASET_ID = "facebook/voxpopuli"
DATASET_CONFIG = "en"
DATASET_SPLIT = "test" # "validation"

device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = SpeechEncoderDecoderModel.from_pretrained(MODEL_ID).to(device)

print(f"Using device: {device}\nStreaming one sample from '{DATASET_ID}'"
"(config: '{DATASET_CONFIG}', split: '{DATASET_SPLIT}')...")
streamed_dataset = load_dataset(
    DATASET_ID,
    DATASET_CONFIG,
    split=DATASET_SPLIT,
    streaming=True,
)
sample = next(iter(streamed_dataset))

audio_input = sample["audio"]["array"]
input_sampling_rate = sample["audio"]["sampling_rate"]

inputs = processor(audio_input, sampling_rate=input_sampling_rate, return_tensors="pt", padding=True)
input_features = inputs.input_values.to(device)

with torch.no_grad():
    predicted_ids = model.generate(input_features, max_length=128)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print(f"\nOriginal: {sample['normalized_text']}")
print(f"Transcribed: {transcription}")
```



### Framework Versions

This model was trained using:
- Python: `^3.10`
- Transformers: `~4.46.3`
- PyTorch: `~2.5.1`
- Datasets: `^3.2.0`
- PEFT: `^0.14.0`
- Accelerate: `^1.4.0`
- Evaluate: `^0.4.3`
- WandB: `^0.19.7`

Visit the [pyproject.toml](https://github.com/hornikmatej/thesis_mit/blob/main/pyproject.toml) file for a complete list of dependencies.

## Citation
Citation
If you use this model or findings from the thesis, please cite:

[![CITE](https://excel.fit.vutbr.cz/wp-content/images/2023/FIT_color_CMYK_EN.svg)](https://www.vut.cz/en/students/final-thesis/detail/164401)

```bibtex
@mastersthesis{Hornik2025EffectiveTraining,
  author       = {Horník, Matej},
  title        = {Effective Training of Neural Networks for Automatic Speech Recognition},
  school       = {Brno University of Technology, Faculty of Information Technology},
  year         = {2025},
  supervisor   = {Polok, Alexander},
  type         = {Master's Thesis},
  note         = {Online. Available at: \url{https://www.vut.cz/en/students/final-thesis/detail/164401}}
}
```

## Acknowledgements
- My supervisor, Ing. Alexander Polok, for his valuable guidance and support.
- The Hugging Face team for their comprehensive transformers, datasets, and evaluate libraries.
- The creators of Wav2Vec2, BART, and the VoxPopuli dataset.

## Contact 
For questions, feedback, or collaboration opportunities related to this thesis or any other stuff, feel free to reach out:

- **Email:** [email protected] / [email protected]
- **GitHub:** [hornikmatej](https://github.com/hornikmatej)