| # NILC Portuguese Word Embeddings — FastText Skip-Gram 600d | |
| Pretrained **static word embeddings** for **Portuguese** (Brazilian + European), trained by the [NILC group](http://nilc.icmc.usp.br/) on a large multi-genre corpus (~1.39B tokens, 17 sources). | |
| This repository contains the **FastText Skip-Gram 600d** model in safetensors format. | |
| --- | |
| ## 📂 Files | |
| - `embeddings.safetensors` → word vectors (`[vocab_size, 600]`) | |
| - `vocab.txt` → vocabulary (one token per line, aligned with rows) | |
| --- | |
| ## 🚀 Usage | |
| ```python | |
| from safetensors.numpy import load_file | |
| data = load_file("embeddings.safetensors") | |
| vectors = data["embeddings"] | |
| with open("vocab.txt") as f: | |
| vocab = [w.strip() for w in f] | |
| word2idx = {w: i for i, w in enumerate(vocab)} | |
| print(vectors[word2idx["rei"]]) # vector for "rei" | |
| ``` | |
| Or in PyTorch: | |
| ```python | |
| from safetensors.torch import load_file | |
| tensors = load_file("embeddings.safetensors") | |
| vectors = tensors["embeddings"] # torch.Tensor | |
| ``` | |
| --- | |
| ## 📖 Reference | |
| ```bibtex | |
| @inproceedings{hartmann-etal-2017-portuguese, | |
| title = {{P}ortuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks}, | |
| author = {Hartmann, Nathan and Fonseca, Erick and Shulby, Christopher and Treviso, Marcos and Silva, J{'e}ssica and Alu{'i}sio, Sandra}, | |
| year = 2017, | |
| month = oct, | |
| booktitle = {Proceedings of the 11th {B}razilian Symposium in Information and Human Language Technology}, | |
| publisher = {Sociedade Brasileira de Computa{\c{c}}{\~a}o}, | |
| address = {Uberl{\^a}ndia, Brazil}, | |
| pages = {122--131}, | |
| url = {https://aclanthology.org/W17-6615/}, | |
| editor = {Paetzold, Gustavo Henrique and Pinheiro, Vl{'a}dia} | |
| } | |
| ``` | |
| --- | |
| ## 📜 License | |
| Creative Commons Attribution 4.0 International (CC BY 4.0) | |

