nilc-nlp
/

fasttext-skip-gram-600d

Feature Extraction

word-embeddings

Model card Files Files and versions

fasttext-skip-gram-600d / README.md

mtreviso's picture

Upload folder using huggingface_hub

6537858 verified about 1 month ago

|

1.82 kB

	# NILC Portuguese Word Embeddings — FastText Skip-Gram 600d

	Pretrained static word embeddings for Portuguese (Brazilian + European), trained by the [NILC group](http://nilc.icmc.usp.br/) on a large multi-genre corpus (~1.39B tokens, 17 sources).

	This repository contains the FastText Skip-Gram 600d model in safetensors format.

	---

	## 📂 Files
	- `embeddings.safetensors` → word vectors (`[vocab_size, 600]`)
	- `vocab.txt` → vocabulary (one token per line, aligned with rows)

	---

	## 🚀 Usage

	```python
	from safetensors.numpy import load_file

	data = load_file("embeddings.safetensors")
	vectors = data["embeddings"]

	with open("vocab.txt") as f:
	vocab = [w.strip() for w in f]

	word2idx = {w: i for i, w in enumerate(vocab)}
	print(vectors[word2idx["rei"]]) # vector for "rei"
	```

	Or in PyTorch:

	```python
	from safetensors.torch import load_file
	tensors = load_file("embeddings.safetensors")
	vectors = tensors["embeddings"] # torch.Tensor
	```

	---

	## 📖 Reference
	```bibtex
	@inproceedings{hartmann-etal-2017-portuguese,
	title = {{P}ortuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks},
	author = {Hartmann, Nathan and Fonseca, Erick and Shulby, Christopher and Treviso, Marcos and Silva, J{'e}ssica and Alu{'i}sio, Sandra},
	year = 2017,
	month = oct,
	booktitle = {Proceedings of the 11th {B}razilian Symposium in Information and Human Language Technology},
	publisher = {Sociedade Brasileira de Computa{\c{c}}{\~a}o},
	address = {Uberl{\^a}ndia, Brazil},
	pages = {122--131},
	url = {https://aclanthology.org/W17-6615/},
	editor = {Paetzold, Gustavo Henrique and Pinheiro, Vl{'a}dia}
	}
	```

	---

	## 📜 License
	Creative Commons Attribution 4.0 International (CC BY 4.0)