---
license: apache-2.0
tags:
- generated_from_trainer
- automatic_speech_recognition
- asr
- nlp
- speech_to_text
- low_resource
metrics:
- wer
base_model: facebook/wav2vec2-large-xlsr-53
model-index:
- name: pidgin-wav2vec2-xlsr53
  results: []
datasets:
- asr-nigerian-pidgin/nigerian-pidgin-1.0
pipeline_tag: automatic-speech-recognition
library_name: transformers
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# pidgin-wav2vec2-xlsr53

This model is a fine-tuned version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53), adapted for transcribing Nigerian Pidgin English. Building on the self-supervised, cross-lingual representations of XLSR-53, it has been trained using the [Nigerian Pidgin dataset](https://huggingface.co/datasets/asr-nigerian-pidgin/nigerian-pidgin-1.0) to handle the phonetic and lexical nuances unique to Nigerian Pidgin, offering significant improvements over zero-shot ASR baselines


It achieves the following results on the evaluation set:
- Loss: 0.6907
- Wer: 0.3161 (val)


## Intended uses & limitations

**Intended Use**: Best suited for automatic speech recognition (ASR) tasks on Nigerian Pidgin audio, such as speech-to-text conversion and related downstream tasks. Best performance is achieved in a clean recording environments with limited background noise. 

**Limitations/Caveats**:

- Trained exclusively on speech from limited demographic groups; may underperform on dialects or accents outside the training set.

- Struggles with numeric phrases and unusual phonetic variants, as noted in qualitative evaluations [see here]
- Struggles with noisy environment and fast-paced speech
- Not suited for critically high-accuracy domains (e.g., legal, medical domain) without further tuning.

## Training and evaluation data

*to be updated*

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 4
- eval_batch_size: 4
- seed: 3407
- gradient_accumulation_steps: 2
- total_train_batch_size: 8
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 1000
- num_epochs: 30
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step  | Validation Loss | Wer    |
|:-------------:|:-----:|:-----:|:---------------:|:------:|
| 6.604         | 1.48  | 500   | 3.0540          | 1.0    |
| 3.0176        | 2.95  | 1000  | 3.0035          | 1.0    |
| 2.1071        | 4.43  | 1500  | 1.0811          | 0.6289 |
| 1.1143        | 5.91  | 2000  | 0.8348          | 0.5017 |
| 0.8501        | 7.39  | 2500  | 0.7707          | 0.4352 |
| 0.7272        | 8.86  | 3000  | 0.7410          | 0.4075 |
| 0.6038        | 10.34 | 3500  | 0.6283          | 0.3850 |
| 0.5334        | 11.82 | 4000  | 0.6356          | 0.3701 |
| 0.4645        | 13.29 | 4500  | 0.6243          | 0.3657 |
| 0.4251        | 14.77 | 5000  | 0.6838          | 0.3492 |
| 0.3801        | 16.25 | 5500  | 0.6619          | 0.3445 |
| 0.3636        | 17.73 | 6000  | 0.6945          | 0.3360 |
| 0.3366        | 19.2  | 6500  | 0.6108          | 0.3340 |
| 0.3146        | 20.68 | 7000  | 0.6511          | 0.3273 |
| 0.3003        | 22.16 | 7500  | 0.6815          | 0.3253 |
| 0.2783        | 23.63 | 8000  | 0.6761          | 0.3215 |
| 0.2601        | 25.11 | 8500  | 0.6762          | 0.3187 |
| 0.2528        | 26.59 | 9000  | 0.6687          | 0.3194 |
| 0.2409        | 28.06 | 9500  | 0.7064          | 0.3163 |
| 0.2359        | 29.54 | 10000 | 0.6907          | 0.3161 |


### Framework versions

- Transformers 4.37.2
- Pytorch 2.0.1+cu117
- Datasets 2.12.0
- Tokenizers 0.15.2