metadata
license: cc-by-sa-4.0
language:
- hr
- bs
- sr
XLM-R-BERTić
This model was produced by pre-training XLM-Roberta-large 48k steps on South Slavic languages.
Benchmarking
Three tasks were chosen for model evaluation:
- Named Entity Recognition (NER)
- Sentiment regression
- COPA (Choice of plausible alternatives)
In all cases, this model was finetuned for specific downstream tasks.
NER
Mean F1 scores were used to evaluate performance.
| system | dataset | F1 score |
|---|---|---|
| BERTić | hr500k | 0.925 |
| XLM-R-BERTić | hr500k | 0.927 |
| XLM-R-SloBERTić | hr500k | 0.923 |
| XLM-Roberta-Large | hr500k | 0.919 |
| crosloengual-bert | hr500k | 0.918 |
| XLM-Roberta-Base | hr500k | 0.903 |
Sentiment regression
ParlaSent dataset was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages. The procedure is explained in greater detail in the dedicated benchmarking repository.
| system | train | test | r^2 |
|---|---|---|---|
| xlm-r-parlasent | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.615 |
| BERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.612 |
| XLM-R-SloBERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.607 |
| XLM-Roberta-Large | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.605 |
| XLM-R-BERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.601 |
| crosloengual-bert | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.537 |
| XLM-Roberta-Base | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.500 |
| dummy (mean) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | -0.12 |
COPA
(to be added soon)
Citation
(to be added soon)