File size: 5,068 Bytes
667e36a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0d2484
667e36a
c0d2484
667e36a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0d2484
667e36a
 
 
 
 
 
 
 
 
 
94da294
667e36a
 
 
 
 
e8d7357
b85b2a7
667e36a
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
language:
- bm
library_name: nemo
datasets:
- RobotsMali/kunkado
- RobotsMali/bam-asr-early

thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- Transducer
- TDT
- FastConformer
- Conformer
- pytorch
- Bambara
- NeMo
license: cc-by-4.0
base_model: nvidia/parakeet-ctc-0.6b
model-index:
- name: soloba-ctc-0.6b-v0
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: bam-asr-early
      type: RobotsMali/bam-asr-early
      split: test
      args:
        language: bm
    metrics:
    - name: Test WER
      type: wer
      value: 35.15760898590088

metrics:
- wer
pipeline_tag: automatic-speech-recognition
---

# Soloni TDT-CTC 114M Bambara

<style>
img {
 display: inline;
}
</style>

[![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--CTC-blue#model-badge)](#model-architecture)
| [![Model size](https://img.shields.io/badge/Params-0.6B-green#model-badge)](#model-architecture)
| [![Language](https://img.shields.io/badge/Language-bm-orange#model-badge)](#datasets)

`soloba-ctc-0.6b-v0` is a fine tuned version of [`nvidia/parakeet-ctc-0.6b`](https://huggingface.co/nvidia/parakeet-ctc-0.6b) on RobotsMali/kunkado and RobotsMali/bam-asr-early. This model cannot does produce Capitalizations but not Punctuations. The model was fine-tuned using **NVIDIA NeMo**.

The model doesn't tag code swicthed expressions in its transcription since for training this model we decided to treat them as a modern variant of the Bambara Language removing all tags and markages.

## **🚨 Important Note**  
This model, along with its associated resources, is part of an **ongoing research effort**, improvements and refinements are expected in future versions. A human evaluation report of the model is coming soon. Users should be aware that:  

- **The model may not generalize very well accross all speaking conditions and dialects.**  
- **Community feedback is welcome, and contributions are encouraged to refine the model further.** 

## NVIDIA NeMo: Training

To fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.

```bash
pip install nemo_toolkit['asr']
``` 

## How to Use This Model

Note that this model has been released for research purposes primarily.

### Load Model with NeMo
```python
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="RobotsMali/soloba-ctc-0.6b-v0")
```

### Transcribe Audio
```python
model.eval()
# Assuming you have a test audio file named sample_audio.wav
asr_model.transcribe(['sample_audio.wav'])
```

### Input

This model accepts any **mono-channel audio (wav files)** as input and resamples them to *16 kHz sample rate* before performing the forward pass

### Output

This model provides transcribed speech as a string for a given speech sample and return an Hypothesis object (under nemo>=2.3)

## Model Architecture

This model uses a FastConformer Ecoder and a CTC decoder. FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer). And a Convolutional Neural Net with CTC loss, the ***Connectionist Temporal Classification*** decoder

## Training

The NeMo toolkit (version 2.3.0) was used for finetuning this model for **183,086 steps** over `nvidia/parakeet-ctc-0.6b` model. This version is trained with this [base config](https://github.com/diarray-hub/bambara-asr/blob/main/kunkado-training/config/soloba/soloba-ctc-v0.0.0.yaml). The full training configurations, scripts, and experimental logs are available here:

🔗 [Bambara-ASR Experiments](https://github.com/diarray-hub/bambara-asr)

The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).

## Dataset
This model was fine-tuned on the [kunkado](https://huggingface.co/datasets/RobotsMali/kunkado) dataset, the semi-labelled subset, which consists of **~120 hours of automatically annotated Bambara speech data**, and the [bam-asr-early](https://huggingface.co/datasets/RobotsMali/bam-asr-early) dataset.

## Performance

We report the Word Error Rate on the test set of bam-asr-early.

|**Decoder (Version)**|**Tokenizer**|**Vocabulary Size**|**bam-asr-early**|
|---------|-----------------------|-----------------|---------|
| v0 | BPE | 512            | 35.16       |


## License
This model is released under the **CC-BY-4.0** license. By using this model, you agree to the terms of the license.

---

Feel free to open a discussion on Hugging Face or [file an issue](https://github.com/diarray-hub/bambara-asr/issues) on github if you have any contributions 

---