|
--- |
|
language: |
|
- bm |
|
library_name: nemo |
|
datasets: |
|
- RobotsMali/kunkado |
|
|
|
thumbnail: null |
|
tags: |
|
- automatic-speech-recognition |
|
- speech |
|
- audio |
|
- Transducer |
|
- TDT |
|
- FastConformer |
|
- Conformer |
|
- pytorch |
|
- Bambara |
|
- NeMo |
|
license: cc-by-4.0 |
|
base_model: RobotsMali/soloba-ctc-0.6b-v0 |
|
model-index: |
|
- name: soloba-ctc-0.6b-v1 |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: kunkado (human-reviewed) |
|
type: RobotsMali/kunkado |
|
split: test |
|
args: |
|
language: bm |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 44.78471577167511 |
|
|
|
metrics: |
|
- wer |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
|
|
# Soloni TDT-CTC 114M Bambara |
|
|
|
<style> |
|
img { |
|
display: inline; |
|
} |
|
</style> |
|
|
|
[](#model-architecture) |
|
| [](#model-architecture) |
|
| [](#datasets) |
|
|
|
`soloba-ctc-0.6b-v1` is a fine tuned version of [`RobotsMali/soloba-ctc-0.6b-v0`](https://huggingface.co/RobotsMali/soloba-ctc-0.6b-v0) on RobotsMali/kunkado. This model cannot does produce Capitalizations but not Punctuations. The model was fine-tuned using **NVIDIA NeMo**. |
|
|
|
The model doesn't tag code swicthed expressions in its transcription since for training this model we decided to treat them as a modern variant of the Bambara Language removing all tags and markages. |
|
|
|
## **🚨 Important Note** |
|
This model, along with its associated resources, is part of an **ongoing research effort**, improvements and refinements are expected in future versions. A human evaluation report of the model is coming soon. Users should be aware that: |
|
|
|
- **The model may not generalize very well accross all speaking conditions and dialects.** |
|
- **Community feedback is welcome, and contributions are encouraged to refine the model further.** |
|
|
|
## NVIDIA NeMo: Training |
|
|
|
To fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version. |
|
|
|
```bash |
|
pip install nemo_toolkit['asr'] |
|
``` |
|
|
|
## How to Use This Model |
|
|
|
Note that this model has been released for research purposes primarily. |
|
|
|
### Load Model with NeMo |
|
```python |
|
import nemo.collections.asr as nemo_asr |
|
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="RobotsMali/soloba-ctc-0.6b-v1") |
|
``` |
|
|
|
### Transcribe Audio |
|
```python |
|
model.eval() |
|
# Assuming you have a test audio file named sample_audio.wav |
|
asr_model.transcribe(['sample_audio.wav']) |
|
``` |
|
|
|
### Input |
|
|
|
This model accepts any **mono-channel audio (wav files)** as input and resamples them to *16 kHz sample rate* before performing the forward pass |
|
|
|
### Output |
|
|
|
This model provides transcribed speech as a string for a given speech sample and return an Hypothesis object (under nemo>=2.3) |
|
|
|
## Model Architecture |
|
|
|
This model uses a FastConformer Ecoder and a CTC decoder. FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer). And a Convolutional Neural Net with CTC loss, the ***Connectionist Temporal Classification*** decoder |
|
|
|
## Training |
|
|
|
The NeMo toolkit (version 2.3.0) was used for finetuning this model for **162,445 steps** over `RobotsMali/soloba-ctc-0.6b-v0` model. This version is trained with this [base config](https://github.com/diarray-hub/bambara-asr/blob/main/kunkado-training/config/soloba/soloba-ctc-v1.5.0.yaml). The full training configurations, scripts, and experimental logs are available here: |
|
|
|
🔗 [Bambara-ASR Experiments](https://github.com/diarray-hub/bambara-asr) |
|
|
|
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py). |
|
|
|
## Dataset |
|
This model was fine-tuned on the [kunkado](https://huggingface.co/datasets/RobotsMali/kunkado) dataset, the human-reviewed subset, which consists of **~40 hours of transcribed Bambara speech data**. The text was normalized with the [bambara-normalizer](https://pypi.org/project/bambara-normalizer/) prior to training, normalizing numbers and removing punctuations. |
|
|
|
## Performance |
|
|
|
We report the Word Error Rate on the test set of bam-asr-early. |
|
|
|
|**Decoder (Version)**|**Tokenizer**|**Vocabulary Size**|**bam-asr-early**|**kunkado**| |
|
|---------|-----------------------|-----------------|---------|---------| |
|
| v0 | BPE | 512 | 35.16 | - | |
|
| v1 | BPE | 512 | - | 44.78 | |
|
|
|
## License |
|
This model is released under the **CC-BY-4.0** license. By using this model, you agree to the terms of the license. |
|
|
|
--- |
|
|
|
Feel free to open a discussion on Hugging Face or [file an issue](https://github.com/diarray-hub/bambara-asr/issues) on github if you have any contributions |
|
|
|
--- |
|
|
|
|