--- tags: - generated_from_trainer datasets: - Graphcore/wikipedia-bert-128 - Graphcore/wikipedia-bert-512 model-index: - name: Graphcore/bert-base-uncased results: [] --- # Graphcore/bert-base-uncased This model is a pre-trained BERT-Base trained in two phases on the [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128) and [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512) datasets. ## Model description Pre-trained BERT Base model trained on Wikipedia data. ## Intended uses & limitations More information needed ## Training and evaluation data Trained on wikipedia datasets: - [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128) - [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512) ## Training procedure Trained MLM and NSP pre-training scheme from [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/abs/1904.00962). Trained on 16 Graphcore Mk2 IPUs. Command lines: Phase 1: ``` python examples/language-modeling/run_pretraining.py \ --config_name bert-base-uncased \ --tokenizer_name bert-base-uncased \ --do_train \ --logging_steps 5 \ --max_seq_length 128 \ --ipu_config_name Graphcore/bert-base-ipu \ --dataset_name Graphcore/wikipedia-bert-128 \ --max_steps 10500 \ --is_already_preprocessed \ --dataloader_num_workers 64 \ --dataloader_mode async_rebatched \ --lamb \ --lamb_no_bias_correction \ --per_device_train_batch_size 32 \ --gradient_accumulation_steps 512 \ --learning_rate 0.006 \ --lr_scheduler_type linear \ --loss_scaling 16384 \ --weight_decay 0.01 \ --warmup_ratio 0.28 \ --save_steps 100 \ --config_overrides "layer_norm_eps=0.001" \ --ipu_config_overrides "device_iterations=1" \ --output_dir output-pretrain-bert-base-phase1 ``` Phase 2: ``` python examples/language-modeling/run_pretraining.py \ --config_name bert-base-uncased \ --tokenizer_name bert-base-uncased \ --model_name_or_path ./output-pretrain-bert-base-phase1 \ --do_train \ --logging_steps 5 \ --max_seq_length 512 \ --ipu_config_name Graphcore/bert-base-ipu \ --dataset_name Graphcore/wikipedia-bert-512 \ --max_steps 2038 \ --is_already_preprocessed \ --dataloader_num_workers 128 \ --dataloader_mode async_rebatched \ --lamb \ --lamb_no_bias_correction \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 512 \ --learning_rate 0.002828 \ --lr_scheduler_type linear \ --loss_scaling 128.0 \ --weight_decay 0.01 \ --warmup_ratio 0.128 \ --config_overrides "layer_norm_eps=0.001" \ --ipu_config_overrides "device_iterations=1,embedding_serialization_factor=2,matmul_proportion=0.22" \ --output_dir output-pretrain-bert-base-phase2 ``` ### Training hyperparameters The following hyperparameters were used during phase 1 training: - learning_rate: 0.006 - train_batch_size: 32 - eval_batch_size: 8 - seed: 42 - distributed_type: IPU - gradient_accumulation_steps: 512 - total_train_batch_size: 65536 - total_eval_batch_size: 128 - optimizer: LAMB - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.28 - training_steps: 10500 - training precision: Mixed Precision The following hyperparameters were used during phase 2 training: - learning_rate: 0.002828 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - distributed_type: IPU - gradient_accumulation_steps: 512 - total_train_batch_size: 16384 - total_eval_batch_size: 128 - optimizer: LAMB - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.128 - training_steps: 2038 - training precision: Mixed Precision ### Framework versions - Transformers 4.17.0.dev0 - Pytorch 1.10.0+cpu - Datasets 1.18.3.dev0 - Tokenizers 0.10.3