readme update
Browse files
README.md
CHANGED
@@ -18,6 +18,7 @@ datasets:
|
|
18 |
- MLCommons/peoples_speech
|
19 |
thumbnail: null
|
20 |
tags:
|
|
|
21 |
- automatic-speech-recognition
|
22 |
- speech
|
23 |
- audio
|
@@ -182,6 +183,77 @@ img {
|
|
182 |
It is an XL version of FastConformer CTC [1] (around 600M parameters) model.
|
183 |
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details.
|
184 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
185 |
## NVIDIA NeMo: Training
|
186 |
|
187 |
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
|
|
|
18 |
- MLCommons/peoples_speech
|
19 |
thumbnail: null
|
20 |
tags:
|
21 |
+
- transformers
|
22 |
- automatic-speech-recognition
|
23 |
- speech
|
24 |
- audio
|
|
|
183 |
It is an XL version of FastConformer CTC [1] (around 600M parameters) model.
|
184 |
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details.
|
185 |
|
186 |
+
## Transformers
|
187 |
+
|
188 |
+
You can now run Parakeet CTC natively with [Transformers](https://github.com/huggingface/transformers) 🤗
|
189 |
+
|
190 |
+
```bash
|
191 |
+
pip install git+https://github.com/huggingface/transformers
|
192 |
+
```
|
193 |
+
|
194 |
+
<details>
|
195 |
+
<summary>➡️ Pipeline usage</summary>
|
196 |
+
|
197 |
+
```python
|
198 |
+
from transformers import pipeline
|
199 |
+
|
200 |
+
pipe = pipeline("automatic-speech-recognition", model="nvidia/parakeet-ctc-0.6b")
|
201 |
+
out = pipe("https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3")
|
202 |
+
print(out)
|
203 |
+
```
|
204 |
+
</details>
|
205 |
+
|
206 |
+
<details>
|
207 |
+
<summary>➡️ AutoModel</summary>
|
208 |
+
|
209 |
+
```python
|
210 |
+
from transformers import AutoModelForCTC, AutoProcessor
|
211 |
+
from datasets import load_dataset, Audio
|
212 |
+
import torch
|
213 |
+
|
214 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
215 |
+
|
216 |
+
processor = AutoProcessor.from_pretrained("nvidia/parakeet-ctc-0.6b")
|
217 |
+
model = AutoModelForCTC.from_pretrained("nvidia/parakeet-ctc-0.6b", dtype="auto", device_map=device)
|
218 |
+
|
219 |
+
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
220 |
+
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
|
221 |
+
speech_samples = [el['array'] for el in ds["audio"][:5]]
|
222 |
+
|
223 |
+
inputs = processor(speech_samples, sampling_rate=processor.feature_extractor.sampling_rate)
|
224 |
+
inputs.to(model.device, dtype=model.dtype)
|
225 |
+
outputs = model.generate(**inputs)
|
226 |
+
print(processor.batch_decode(outputs))
|
227 |
+
```
|
228 |
+
</details>
|
229 |
+
|
230 |
+
<details>
|
231 |
+
<summary>➡️ Training</summary>
|
232 |
+
|
233 |
+
```python
|
234 |
+
from transformers import AutoModelForCTC, AutoProcessor
|
235 |
+
from datasets import load_dataset, Audio
|
236 |
+
import torch
|
237 |
+
|
238 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
239 |
+
|
240 |
+
processor = AutoProcessor.from_pretrained("nvidia/parakeet-ctc-0.6b")
|
241 |
+
model = AutoModelForCTC.from_pretrained("nvidia/parakeet-ctc-0.6b", dtype="auto", device_map=device)
|
242 |
+
|
243 |
+
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
244 |
+
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
|
245 |
+
speech_samples = [el['array'] for el in ds["audio"][:5]]
|
246 |
+
text_samples = [el for el in ds["text"][:5]]
|
247 |
+
|
248 |
+
# passing `text` to the processor will prepare inputs' `labels` key
|
249 |
+
inputs = processor(audio=speech_samples, text=text_samples, sampling_rate=processor.feature_extractor.sampling_rate)
|
250 |
+
inputs.to(device, dtype=model.dtype)
|
251 |
+
|
252 |
+
outputs = model(**inputs)
|
253 |
+
outputs.loss.backward()
|
254 |
+
```
|
255 |
+
</details>
|
256 |
+
|
257 |
## NVIDIA NeMo: Training
|
258 |
|
259 |
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
|