File size: 3,714 Bytes
e59388d
 
 
 
 
 
 
 
 
 
9597662
e59388d
9597662
e59388d
9597662
e59388d
 
 
 
37ac116
 
 
 
 
 
 
 
e59388d
 
50a1a05
 
 
 
 
 
 
 
 
 
 
e59388d
 
 
 
5032954
 
e59388d
 
13fa50e
 
e59388d
37ac116
 
e59388d
 
 
e049326
 
e59388d
 
e049326
e59388d
 
 
 
 
e049326
9597662
 
e59388d
e049326
e59388d
 
e049326
 
 
 
 
e59388d
 
 
e049326
e59388d
e049326
e59388d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
language: hr
datasets:
- parlaspeech-hr
tags:
- audio
- automatic-speech-recognition
- parlaspeech
widget:
- example_title: example 1
  src: https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr-lm/raw/main/1800.m4a
- example_title: example 2
  src: https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr-lm/raw/main/00020578b.flac.wav
- example_title: example 3
  src: https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr-lm/raw/main/00020570a.flac.wav
---

# wav2vec2-xls-r-parlaspeech-hr-lm

This model for Croatian ASR is based on the 
[facebook/wav2vec2-xls-r-300m model](https://huggingface.co/facebook/wav2vec2-xls-r-300m) 
and was fine-tuned with 300 hours of recordings and transcripts from the ASR 
Croatian parliament dataset [ParlaSpeech-HR v1.0](http://hdl.handle.net/11356/1494).

<div style="border: 5px solid #ff6700;  padding: 10px; margin: 10px 0;">
  <strong>Notice:</strong> ParlaSpeech corpora are currently in the process of enrichment with new features. Follow our progress here: <a href="http://clarinsi.github.io/parlaspeech">http://clarinsi.github.io/parlaspeech</a>
</div>

If you use this model, please cite the following paper:
```
@inproceedings{ljubevsic2022parlaspeech,
  title={ParlaSpeech-HR-a Freely Available ASR Dataset for Croatian Bootstrapped from the ParlaMint Corpus},
  author={Ljube{\v{s}}i{\'c}, Nikola and Kor{\v{z}}inek, Danijel and Rupnik, Peter and Jazbec, Ivo-Pavao},
  booktitle={Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference},
  pages={111--116},
  year={2022},
  url={http://www.lrec-conf.org/proceedings/lrec2022/workshops/ParlaCLARINIII/pdf/2022.parlaclariniii-1.16.pdf}
}
```



## Metrics

Evaluation is performed on the dev and test portions of the [ParlaSpeech-HR v1.0](http://hdl.handle.net/11356/1494) dataset.

|split|CER|WER|
|---|---|---|
|dev|0.0448|0.1129|
|test|0.0363|0.0985|

There are multiple models available, and in terms of CER and WER, the best-performing model 
is [wav2vec2-large-slavic-parlaspeech-hr-lm](https://huggingface.co/classla/wav2vec2-large-slavic-parlaspeech-hr-lm).

## Usage in `transformers`

Tested with `transformers==4.18.0`, `torch==1.11.0`, and `SoundFile==0.10.3.post1`.


```python
from transformers import Wav2Vec2ProcessorWithLM, Wav2Vec2ForCTC
import soundfile as sf
import torch
import os
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# load model and tokenizer
processor = Wav2Vec2ProcessorWithLM.from_pretrained(
    "classla/wav2vec2-xls-r-parlaspeech-hr-lm")
model = Wav2Vec2ForCTC.from_pretrained("classla/wav2vec2-xls-r-parlaspeech-hr-lm")
# download the example wav files:
os.system("wget https://huggingface.co/classla/wav2vec2-large-slavic-parlaspeech-hr/raw/main/00020570a.flac.wav")
# read the wav file 
speech, sample_rate = sf.read("00020570a.flac.wav")
input_values = processor(speech, sampling_rate=sample_rate, return_tensors="pt").input_values.cuda()
inputs = processor(speech, sampling_rate=sample_rate, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
transcription = processor.batch_decode(logits.numpy()).text[0]

# remove the raw wav file
os.system("rm 00020570a.flac.wav")
transcription

# transcription: 'velik broj poslovnih subjekata posluje sa minusom velik dio'
```



## Training hyperparameters

In fine-tuning, the following arguments were used:

| arg                           | value |
|-------------------------------|-------|
| `per_device_train_batch_size` | 16    |
| `gradient_accumulation_steps` | 4     |
| `num_train_epochs`            | 8     |
| `learning_rate`               | 3e-4  |
| `warmup_steps`                | 500   |