mtreviso's picture
Upload README.md with huggingface_hub
f2e5fc8 verified
metadata
language: pt
tags:
  - word-embeddings
  - static
  - portuguese
  - fasttext
  - skip-gram
  - 300d
license: cc-by-4.0
library_name: safetensors
pipeline_tag: feature-extraction

NILC Portuguese Word Embeddings — FastText Skip-Gram 300d

This repository contains the FastText Skip-Gram 300d model in safetensors format.

About

NILC-Embeddings is a repository for storing and sharing word embeddings for the Portuguese language. The goal is to provide ready-to-use vector resources for Natural Language Processing (NLP) and Machine Learning tasks.

The embeddings were trained on a large Portuguese corpus (Brazilian + European), composed of 17 corpora (~1.39B tokens). Training was carried out with the following algorithms: Word2Vec, FastText, Wang2Vec, and GloVe.


📂 Files

  • embeddings.safetensors → embedding matrix ([vocab_size, 300])
  • vocab.txt → vocabulary (one token per line, aligned with rows)

🚀 Usage

from huggingface_hub import hf_hub_download
from safetensors.numpy import load_file

path = hf_hub_download(repo_id="nilc-nlp/fasttext-skip-gram-300d",
                       filename="embeddings.safetensors")

data = load_file(path)
vectors = data["embeddings"]

vocab_path = hf_hub_download(repo_id="nilc-nlp/fasttext-skip-gram-300d",
                             filename="vocab.txt")
with open(vocab_path) as f:
    vocab = [w.strip() for w in f]

print(vectors.shape)

Or in PyTorch:

from safetensors.torch import load_file
tensors = load_file("embeddings.safetensors")
vectors = tensors["embeddings"]  # torch.Tensor

📊 Corpus

The embeddings were trained on a combination of 17 corpora (~1.39B tokens):

Corpus Tokens Types Genre Description
LX-Corpus [Rodrigues et al. 2016] 714,286,638 2,605,393 Mixed genres Large collection of texts from 19 sources, mostly European Portuguese
Wikipedia 219,293,003 1,758,191 Encyclopedic Wikipedia dump (2016-10-20)
GoogleNews 160,396,456 664,320 Informative News crawled from Google News
SubIMDB-PT 129,975,149 500,302 Spoken Movie subtitles from IMDb
G1 105,341,070 392,635 Informative News from G1 portal (2014–2015)
PLN-Br [Bruckschen et al. 2008] 31,196,395 259,762 Informative Corpus of PLN-BR project (1994–2005)
Domínio Público 23,750,521 381,697 Prose 138,268 literary works
Lacio-Web [Aluísio et al. 2003] 8,962,718 196,077 Mixed Literary, informative, scientific, law, didactic texts
Literatura Brasileira 1,299,008 66,706 Prose Classical Brazilian fiction e-books
Mundo Estranho 1,047,108 55,000 Informative Texts from Mundo Estranho magazine
CHC 941,032 36,522 Informative Texts from Ciência Hoje das Crianças
FAPESP 499,008 31,746 Science communication Texts from Pesquisa FAPESP magazine
Textbooks 96,209 11,597 Didactic Elementary school textbooks
Folhinha 73,575 9,207 Informative Children’s news from Folhinha (Folha de São Paulo)
NILC subcorpus 32,868 4,064 Informative Children’s texts (3rd–4th grade)
Para Seu Filho Ler 21,224 3,942 Informative Children’s news from Zero Hora
SARESP 13,308 3,293 Didactic School evaluation texts
Total 1,395,926,282 3,827,725

📖 Paper

Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks
Hartmann, N. et al. (2017), STIL 2017.
ArXiv Paper

BibTeX

@inproceedings{hartmann-etal-2017-portuguese,
  title        = {{P}ortuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks},
  author       = {Hartmann, Nathan  and Fonseca, Erick  and Shulby, Christopher  and Treviso, Marcos  and Silva, J{'e}ssica  and Alu{'i}sio, Sandra},
  year         = 2017,
  month        = oct,
  booktitle    = {Proceedings of the 11th {B}razilian Symposium in Information and Human Language Technology},
  publisher    = {Sociedade Brasileira de Computa{\c{c}}{\~a}o},
  address      = {Uberl{\^a}ndia, Brazil},
  pages        = {122--131},
  url          = {https://aclanthology.org/W17-6615/},
  editor       = {Paetzold, Gustavo Henrique  and Pinheiro, Vl{'a}dia}
}

📜 License

Creative Commons Attribution 4.0 International (CC BY 4.0)