mtreviso commited on
Commit
6537858
·
verified ·
1 Parent(s): 93564d3

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +59 -0
  2. embeddings.safetensors +3 -0
  3. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NILC Portuguese Word Embeddings — FastText Skip-Gram 600d
2
+
3
+ Pretrained **static word embeddings** for **Portuguese** (Brazilian + European), trained by the [NILC group](http://nilc.icmc.usp.br/) on a large multi-genre corpus (~1.39B tokens, 17 sources).
4
+
5
+ This repository contains the **FastText Skip-Gram 600d** model in safetensors format.
6
+
7
+ ---
8
+
9
+ ## 📂 Files
10
+ - `embeddings.safetensors` → word vectors (`[vocab_size, 600]`)
11
+ - `vocab.txt` → vocabulary (one token per line, aligned with rows)
12
+
13
+ ---
14
+
15
+ ## 🚀 Usage
16
+
17
+ ```python
18
+ from safetensors.numpy import load_file
19
+
20
+ data = load_file("embeddings.safetensors")
21
+ vectors = data["embeddings"]
22
+
23
+ with open("vocab.txt") as f:
24
+ vocab = [w.strip() for w in f]
25
+
26
+ word2idx = {w: i for i, w in enumerate(vocab)}
27
+ print(vectors[word2idx["rei"]]) # vector for "rei"
28
+ ```
29
+
30
+ Or in PyTorch:
31
+
32
+ ```python
33
+ from safetensors.torch import load_file
34
+ tensors = load_file("embeddings.safetensors")
35
+ vectors = tensors["embeddings"] # torch.Tensor
36
+ ```
37
+
38
+ ---
39
+
40
+ ## 📖 Reference
41
+ ```bibtex
42
+ @inproceedings{hartmann-etal-2017-portuguese,
43
+ title = {{P}ortuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks},
44
+ author = {Hartmann, Nathan and Fonseca, Erick and Shulby, Christopher and Treviso, Marcos and Silva, J{'e}ssica and Alu{'i}sio, Sandra},
45
+ year = 2017,
46
+ month = oct,
47
+ booktitle = {Proceedings of the 11th {B}razilian Symposium in Information and Human Language Technology},
48
+ publisher = {Sociedade Brasileira de Computa{\c{c}}{\~a}o},
49
+ address = {Uberl{\^a}ndia, Brazil},
50
+ pages = {122--131},
51
+ url = {https://aclanthology.org/W17-6615/},
52
+ editor = {Paetzold, Gustavo Henrique and Pinheiro, Vl{'a}dia}
53
+ }
54
+ ```
55
+
56
+ ---
57
+
58
+ ## 📜 License
59
+ Creative Commons Attribution 4.0 International (CC BY 4.0)
embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b1a94874df702289ba7a28cb9b8dc8e7c4d7da9c8032993ad0bb8a07ceb0efa
3
+ size 2231052096
vocab.txt ADDED
The diff for this file is too large to render. See raw diff