Soloni TDT-CTC 114M Bambara

| |

soloba-ctc-0.6b-v0 is a fine tuned version of nvidia/parakeet-ctc-0.6b on RobotsMali/kunkado and RobotsMali/bam-asr-early. This model cannot does produce Capitalizations but not Punctuations. The model was fine-tuned using NVIDIA NeMo.

The model doesn't tag code swicthed expressions in its transcription since for training this model we decided to treat them as a modern variant of the Bambara Language removing all tags and markages.

🚨 Important Note

This model, along with its associated resources, is part of an ongoing research effort, improvements and refinements are expected in future versions. A human evaluation report of the model is coming soon. Users should be aware that:

The model may not generalize very well accross all speaking conditions and dialects.
Community feedback is welcome, and contributions are encouraged to refine the model further.

NVIDIA NeMo: Training

To fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.

pip install nemo_toolkit['asr']

How to Use This Model

Note that this model has been released for research purposes primarily.

Load Model with NeMo

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="RobotsMali/soloba-ctc-0.6b-v0")

Transcribe Audio

model.eval()
# Assuming you have a test audio file named sample_audio.wav
asr_model.transcribe(['sample_audio.wav'])

Input

This model accepts any mono-channel audio (wav files) as input and resamples them to 16 kHz sample rate before performing the forward pass

Output

This model provides transcribed speech as a string for a given speech sample and return an Hypothesis object (under nemo>=2.3)

Model Architecture

This model uses a FastConformer Ecoder and a CTC decoder. FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: Fast-Conformer Model. And a Convolutional Neural Net with CTC loss, the Connectionist Temporal Classification decoder

Training

The NeMo toolkit (version 2.3.0) was used for finetuning this model for 183,086 steps over nvidia/parakeet-ctc-0.6b model. This version is trained with this base config. The full training configurations, scripts, and experimental logs are available here:

🔗 Bambara-ASR Experiments

The tokenizers for these models were built using the text transcripts of the train set with this script.

Dataset

This model was fine-tuned on the kunkado dataset, the semi-labelled subset, which consists of ~120 hours of automatically annotated Bambara speech data, and the bam-asr-early dataset.

Performance

We report the Word Error Rate on the test set of bam-asr-early.

Decoder (Version)	Tokenizer	Vocabulary Size	bam-asr-all
v0	BPE	512	35.16

License

This model is released under the CC-BY-4.0 license. By using this model, you agree to the terms of the license.

Feel free to open a discussion on Hugging Face or file an issue on github if you have any contributions

RobotsMali
/

soloba-ctc-0.6b-v0