XLM-R-BERTić
This model was produced by pre-training XLM-Roberta-large 48k steps on South Slavic languages using XLM-R-BERTić dataset
Benchmarking
Three tasks were chosen for model evaluation:
- Named Entity Recognition (NER)
- Sentiment regression
- COPA (Choice of plausible alternatives)
In all cases, this model was finetuned for specific downstream tasks.
NER
Average macro-F1 scores from three runs were used to evaluate performance. Datasets used: hr500k, ReLDI-sr, ReLDI-hr, and SETimes.SR.
Sentiment regression
ParlaSent dataset was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages.
The procedure is explained in greater detail in the dedicated benchmarking repository.
| system |
train |
test |
r^2 |
| xlm-r-parlasent |
ParlaSent_BCS.jsonl |
ParlaSent_BCS_test.jsonl |
0.615 |
| BERTić |
ParlaSent_BCS.jsonl |
ParlaSent_BCS_test.jsonl |
0.612 |
| XLM-R-SloBERTić |
ParlaSent_BCS.jsonl |
ParlaSent_BCS_test.jsonl |
0.607 |
| XLM-Roberta-Large |
ParlaSent_BCS.jsonl |
ParlaSent_BCS_test.jsonl |
0.605 |
| XLM-R-BERTić |
ParlaSent_BCS.jsonl |
ParlaSent_BCS_test.jsonl |
0.601 |
| crosloengual-bert |
ParlaSent_BCS.jsonl |
ParlaSent_BCS_test.jsonl |
0.537 |
| XLM-Roberta-Base |
ParlaSent_BCS.jsonl |
ParlaSent_BCS_test.jsonl |
0.500 |
| dummy (mean) |
ParlaSent_BCS.jsonl |
ParlaSent_BCS_test.jsonl |
-0.12 |
COPA
Citation
Please cite the following paper:
@inproceedings{ljubesic-etal-2024-language,
title = "Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining",
author = "Ljube{\v{s}}i{\'c}, Nikola and
Suchomel, V{\'\i}t and
Rupnik, Peter and
Kuzman, Taja and
van Noord, Rik",
editor = "Melero, Maite and
Sakti, Sakriani and
Soria, Claudia",
booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.sigul-1.23",
pages = "189--203",
}