File size: 5,068 Bytes
667e36a c0d2484 667e36a c0d2484 667e36a c0d2484 667e36a 94da294 667e36a e8d7357 b85b2a7 667e36a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
language:
- bm
library_name: nemo
datasets:
- RobotsMali/kunkado
- RobotsMali/bam-asr-early
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- Transducer
- TDT
- FastConformer
- Conformer
- pytorch
- Bambara
- NeMo
license: cc-by-4.0
base_model: nvidia/parakeet-ctc-0.6b
model-index:
- name: soloba-ctc-0.6b-v0
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: bam-asr-early
type: RobotsMali/bam-asr-early
split: test
args:
language: bm
metrics:
- name: Test WER
type: wer
value: 35.15760898590088
metrics:
- wer
pipeline_tag: automatic-speech-recognition
---
# Soloni TDT-CTC 114M Bambara
<style>
img {
display: inline;
}
</style>
[](#model-architecture)
| [](#model-architecture)
| [](#datasets)
`soloba-ctc-0.6b-v0` is a fine tuned version of [`nvidia/parakeet-ctc-0.6b`](https://huggingface.co/nvidia/parakeet-ctc-0.6b) on RobotsMali/kunkado and RobotsMali/bam-asr-early. This model cannot does produce Capitalizations but not Punctuations. The model was fine-tuned using **NVIDIA NeMo**.
The model doesn't tag code swicthed expressions in its transcription since for training this model we decided to treat them as a modern variant of the Bambara Language removing all tags and markages.
## **🚨 Important Note**
This model, along with its associated resources, is part of an **ongoing research effort**, improvements and refinements are expected in future versions. A human evaluation report of the model is coming soon. Users should be aware that:
- **The model may not generalize very well accross all speaking conditions and dialects.**
- **Community feedback is welcome, and contributions are encouraged to refine the model further.**
## NVIDIA NeMo: Training
To fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
```bash
pip install nemo_toolkit['asr']
```
## How to Use This Model
Note that this model has been released for research purposes primarily.
### Load Model with NeMo
```python
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="RobotsMali/soloba-ctc-0.6b-v0")
```
### Transcribe Audio
```python
model.eval()
# Assuming you have a test audio file named sample_audio.wav
asr_model.transcribe(['sample_audio.wav'])
```
### Input
This model accepts any **mono-channel audio (wav files)** as input and resamples them to *16 kHz sample rate* before performing the forward pass
### Output
This model provides transcribed speech as a string for a given speech sample and return an Hypothesis object (under nemo>=2.3)
## Model Architecture
This model uses a FastConformer Ecoder and a CTC decoder. FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer). And a Convolutional Neural Net with CTC loss, the ***Connectionist Temporal Classification*** decoder
## Training
The NeMo toolkit (version 2.3.0) was used for finetuning this model for **183,086 steps** over `nvidia/parakeet-ctc-0.6b` model. This version is trained with this [base config](https://github.com/diarray-hub/bambara-asr/blob/main/kunkado-training/config/soloba/soloba-ctc-v0.0.0.yaml). The full training configurations, scripts, and experimental logs are available here:
🔗 [Bambara-ASR Experiments](https://github.com/diarray-hub/bambara-asr)
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
## Dataset
This model was fine-tuned on the [kunkado](https://huggingface.co/datasets/RobotsMali/kunkado) dataset, the semi-labelled subset, which consists of **~120 hours of automatically annotated Bambara speech data**, and the [bam-asr-early](https://huggingface.co/datasets/RobotsMali/bam-asr-early) dataset.
## Performance
We report the Word Error Rate on the test set of bam-asr-early.
|**Decoder (Version)**|**Tokenizer**|**Vocabulary Size**|**bam-asr-early**|
|---------|-----------------------|-----------------|---------|
| v0 | BPE | 512 | 35.16 |
## License
This model is released under the **CC-BY-4.0** license. By using this model, you agree to the terms of the license.
---
Feel free to open a discussion on Hugging Face or [file an issue](https://github.com/diarray-hub/bambara-asr/issues) on github if you have any contributions
---
|