alephbert-base / README.md
aseker00's picture
Update readme.
f103098
|
raw
history blame
1.36 kB
metadata
language:
  - he
tags:
  - language model
license: apache-2.0
datasets:
  - oscar
  - wikipedia
  - twitter

AlephBERT

Hebrew Language Model

State-of-the-art language model for Hebrew. Based on Google's BERT architecture (Devlin et al. 2018).

How to use

from transformers import BertModel, BertTokenizerFast

alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base')
alephbert = BertModel.from_pretrained('onlplab/alephbert-base')

# if not finetuning - disable dropout
alephbert.eval()

Training data

  1. OSCAR (Ortiz, 2019) Hebrew section (10GB text, 20M sentences).
  2. Hebrew dump of Wikipedia (650 MB text, 3.8M sentences).
  3. Hebrew Tweets collected from the Twitter sample stream (7G text, 70M sentences).

Training procedure

Trained on a DGX machine (8 V100 GPUs) using the standard huggingface training procedure.

To optimize training time we split the data into 4 sections based on max number of tokens:

  1. num tokens < 32 (70M sentences)
  2. 32 <= num tokens < 64 (12M sentences)
  3. 64 <= num tokens < 128 (10M sentences)
  4. 128 <= num tokens < 512 (70M sentences)

Each section was trained for 5 epochs with an initial learning rate set to 1e-4.

Total training time was 5 days.