asr-nigerian-pidgin
/

pidgin-wav2vec2-xlsr53

@@ -16,7 +16,6 @@ model-index:
 datasets:
 - asr-nigerian-pidgin/nigerian-pidgin-1.0
 pipeline_tag: automatic-speech-recognition
-library_name: transformers
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -24,36 +23,41 @@ should probably proofread and complete it, then remove this comment. -->
 # pidgin-wav2vec2-xlsr53
-This model is a fine-tuned version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53), adapted for transcribing Nigerian Pidgin English. Building on the self-supervised, cross-lingual representations of XLSR-53, it has been trained using the [Nigerian Pidgin dataset](https://huggingface.co/datasets/asr-nigerian-pidgin/nigerian-pidgin-1.0) to handle the phonetic and lexical nuances unique to Nigerian Pidgin, offering significant improvements over zero-shot ASR baselines
 It achieves the following results on the evaluation set:
 - Loss: 0.6907
 - Wer: 0.3161 (val)
-## Intended uses & limitations
-**Intended Use**: Best suited for automatic speech recognition (ASR) tasks on Nigerian Pidgin audio, such as speech-to-text conversion and related downstream tasks. Best performance is achieved in a clean recording environments with limited background noise.
-**Limitations/Caveats**:
-- Trained exclusively on speech from limited demographic groups; may underperform on dialects or accents outside the training set.
-- Struggles with numeric phrases and unusual phonetic variants, as noted in qualitative evaluations [see here]
-- Struggles with noisy environment and fast-paced speech
-- Not suited for critically high-accuracy domains (e.g., legal, medical domain) without further tuning.
 ## Training and evaluation data
-*to be updated*
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
-- learning_rate: 0.0001
 - train_batch_size: 4
 - eval_batch_size: 4
 - seed: 3407
@@ -65,7 +69,15 @@ The following hyperparameters were used during training:
 - num_epochs: 30
 - mixed_precision_training: Native AMP
-### Training results
 | Training Loss | Epoch | Step  | Validation Loss | Wer    |
 |:-------------:|:-----:|:-----:|:---------------:|:------:|
@@ -93,7 +105,7 @@ The following hyperparameters were used during training:
 ### Framework versions
-- Transformers 4.37.2
 - Pytorch 2.0.1+cu117
-- Datasets 2.12.0
 - Tokenizers 0.15.2

 datasets:
 - asr-nigerian-pidgin/nigerian-pidgin-1.0
 pipeline_tag: automatic-speech-recognition
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 # pidgin-wav2vec2-xlsr53
+This model is a fine-tuned version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on the [Nigerian Pidgin](https://huggingface.co/datasets/asr-nigerian-pidgin/nigerian-pidgin-1.0) dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.6907
 - Wer: 0.3161 (val)
+## Model description
+*to be updated*
+## Intended uses & limitations
+**Intended Uses**:
+- Best suited for automatic speech recognition (ASR) tasks on Nigerian Pidgin audio, such as speech-to-text conversion and related downstream tasks.
+- Academic research on low-resource and creole language ASR.
+**Known Limitations**:
+- Performance may degrade with dialectal variation, heavy code-switching, or noisy audio environments.
+- Model reflects biases present in the training dataset, which may affect accuracy on underrepresented demographics, phonetic variations or topics.
+- May struggle with rare words, numerals, and domain-specific terminology not well represented in the training set.
+- Not recommended for high-stakes domains (e.g., legal, medical) without domain-specific retraining/finetuning.
 ## Training and evaluation data
+The model was fine-tuned on the [Nigerian Pidgin ASR v1.0 dataset](https://huggingface.co/datasets/asr-nigerian-pidgin/nigerian-pidgin-1.0), consisting of over 4,200 utterances recorded by 10 native speakers (balanced across gender and age) using the LIG-Aikuma mobile platform. Recordings were collected in controlled environments to ensure high-quality audio.
+Performance: WER 7.4%(train), 31.6% (validation) / 29.6% (test), exceeding baseline benchmarks like QuartzNet and zero-shot XLSR. This results demonstrate the effectiveness of targeted fine-tuning for low-resource ASR.
 ## Training procedure
+We fine-tuned the facebook/wav2vec2-large-xlsr-53 model using the Nigerian Pidgin ASR dataset, following the methodology outlined in the XLSR-53 paper. Training was performed on a single NVIDIA A100 GPU using the Hugging Face transformers library with fp16 mixed precision to accelerate computation and reduce memory usage.
+A key modification from the standard setup was unfreezing the feature encoder during fine-tuning. This adjustment yielded improved performance, lowering word error rates (WER) on both validation and test sets compared to the frozen-encoder approach.
 ### Training hyperparameters
 The following hyperparameters were used during training:
+- learning_rate: 1e-4
 - train_batch_size: 4
 - eval_batch_size: 4
 - seed: 3407
 - num_epochs: 30
 - mixed_precision_training: Native AMP
+This configuration balanced training stability, efficiency, and accuracy, allowing the model to adapt effectively to Nigerian Pidgin speech patterns despite the dataset’s limited size
+### Perfomance Comparision for Frozen Encoder and Unfrozen Encoder:
+| Encoder State | Val WER | Test WER |
+| ------------- | ------- | -------- |
+| Frozen        | 0.332   |   0.436  |
+| Unfrozen      | 0.3161  |   0.296  |
+### Training results(Unfrozen Model)
 | Training Loss | Epoch | Step  | Validation Loss | Wer    |
 |:-------------:|:-----:|:-----:|:---------------:|:------:|
 ### Framework versions
+- Transformers 4.48.2
 - Pytorch 2.0.1+cu117
+- Datasets 2.20.0
 - Tokenizers 0.15.2