blm-gpt2s-90M-s42 / README.md
Kessbitz's picture
Update README.md
2bcd5dd verified
---
language:
- en
tags:
- babylm-baseline
- strict
- babylm-2025
---
# Model Card for the Preference Optimization Interaction Baseline
<!-- Provide a quick summary of what the model is/does. [Optional] -->
A 124M model with the GPT-2 architecture trained with the next token prediction loss for 10 epochs (~900M words) **on 90% of the BabyLM corpus**, as a naive autoregressive baseline for the Interaction track of the 2025 BabyLM challenge.
This model card is based on the model card of the BabyLM [100M GPT-2 baseline](https://huggingface.co/BabyLM-community/babylm-baseline-100m-gpt2/edit/main/README.md).
# Table of Contents
- [Model Card for Interaction GPT-2 Baseline](#model-card-for--model_id-)
- [Table of Contents](#table-of-contents)
- [Model Details](#model-details)
- [Model Description](#model-description)
- [Uses](#uses)
- [Training Details](#training-details)
- [Training Data](#training-data)
- [Hyperparameters](#hyperparameters)
- [Training Procedure](#training-procedure)
- [Size and Checkpoints](#size-and-checkpoints)
- [Evaluation](#evaluation)
- [Testing Data & Metrics](#testing-data-factors--metrics)
- [Testing Data](#testing-data)
- [Metrics](#metrics)
- [Results](#results)
- [Technical Specifications](#technical-specifications-optional)
- [Model Architecture and Objective](#model-architecture-and-objective)
- [Compute Infrastructure](#compute-infrastructure)
- [Hardware](#hardware)
- [Software](#software)
- [Training Time](#training-time)
- [Citation](#citation)
- [Model Card Authors](#model-card-authors-optional)
- [Bibliography](#bibliography)
# Model Details
## Model Description
<!-- Provide a longer summary of what this model is/does. -->
This is the pretrained GPT-2 model as a basis for PPO finetuning for the Interaction Track of the 2025 BabyLM challenge.
- **Developed by:** Jonas Mayer Martins, Ali Hamza Bashir, Muhammad Rehan Khalid
- **Model type:** Causal language model
- **Language(s) (NLP):** eng
- **Resources for more information:**
- [GitHub Repo](https://github.com/malihamza/babylm-interactive-learning)
# Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
This is a pre-trained language model.
It can be used to evaluate tasks in a zero-shot manner and also can be fine-tuned for downstream tasks.
It can be used for language generation but given its small size and low number of words trained on, do not expect LLM-level performance.
# Training Details
## Training Data
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
We used the BabyLM 100M (Strict) dataset for training. **We trained the tokenizer and model on randomly selected 90% of the corpus**, which is composed of the following:
| Source | Weight | Domain | Citation | Website | License |
| --- | --- | --- | --- | --- | --- |
| BNC | 8% | Dialogue | BNC Consortium (2007) | [link](http://www.natcorp.ox.ac.uk/) | [link](http://www.natcorp.ox.ac.uk/docs/licence.html) <sup>1</sup> |
| CHILDES | 29% | Dialogue, Child-Directed | MacWhinney (2000) | | [link](https://talkbank.org/share/rules.html) |
| Project Gutenberg | 26% | Fiction, Nonfiction | Gerlach & Font-Clos (2020) | [link](https://github.com/pgcorpus/gutenberg) | [link](https://www.gutenberg.org/policy/license.html) |
| OpenSubtitles | 20% | Dialogue, Scripted | Lison & Tiedermann (2016) | [link](https://opus.nlpl.eu/OpenSubtitles-v2018.php) | Open source |
| Simple English Wikipedia | 15% | Nonfiction | -- | [link](https://dumps.wikimedia.org/simplewiki/20221201/) | [link](https://dumps.wikimedia.org/legal.html) |
| Switchboard | 1% | Dialogue | Godfrey et al. (1992), Stolcke et al., (2000) | [link](http://compprag.christopherpotts.net/swda.html) | [link](http://compprag.christopherpotts.net/swda.html) |
<sup>1</sup> Our distribution of part of the BNC Texts is permitted under the fair dealings provision of copyright law (see term (2g) in the BNC license).
## Hyperparameters
| Hyperparameter | Value |
| --- | --- |
| Number of epochs | 10 |
| Datapoint length | 512 |
| Batch size | 16 |
| Gradient accumulation steps | 4 |
| Learning rate | 0.0005 |
| Number of steps | 211650 |
| Warmup steps | 2116 |
| Gradient clipping | 1 |
| Optimizer | AdamW |
| Optimizer Beta_1 | 0.9 |
| Optimizer Beta_2 | 0.999 |
| Optimizer Epsilon | 10<sup>-8</sup>|
| Tokenizer | BytePairBPE |
| Vocab Size | 16000 |
## Training Procedure
The model is trained with next token prediction loss for 10 epochs.
### Size and checkpoints
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
The model has 124M parameters.
In total we train on around 1B words and provide multiple checkpoints from the training.
Specifically we provode:
- Checkpoints every 1M words for the first 10M words
- Checkpoints every 10M words first 100M words
- Checkpoints every 100M words until 1B words
# Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
This model is evaluated in two ways:
1. We do zero-shot evaluation on 7 tasks.
2. We do fine-tuning on a subset of the (Super)GLUE tasks (Wang et al., ICLR 2019; Wang et al., NeurIPS 2019) .
## Testing Data & Metrics
### Testing Data
<!-- This should link to a Data Card if possible. -->
For the BLiMP, BLiMP supplement, and EWoK tasks, we use a filtered version of the dataset to only include examples with words found in the BabyLM dataset.
For the Finetuning task, we both filter and sample down to a maximum 10 000 train examples.
*Validation Data*
*Zero-shot Tasks*
- **BLiMP**: The Benchmark of Linguistic Minimal Pairs evaluates the model's linguistic ability by seeing if it can recognize the grammatically correct sentence from a pair of minimally different sentences. It tests various grammatical phenomena.(Warstadt et al., TACL 2020)
- **BLiMP Supplement**: A supplement to BLiMP introduced in the first edition of the BabyLM challenge. More focused on dialogue and questions. (Warstadt et al., CoNLL-BabyLM 2023)
- **EWoK**: Works similarly to BLiMP but looks the model's internal world knowledge. Looking at both whter a model has physical and social knowledge. (Ivanova et al., 2024)
- **Eye Tracking and Self-paced Reading**: Looks at whether the model can mimick the eye tracking and reading time of a human but using surprisal of a word as a proxy for time spent reading a word. (de Varda et al., BRM 2024)
- **Entity Tracking**: Checks whether a model can keep track of the changes to the states of entities as text/dialogue unfolds. (Kim & Schuster, ACL 2023)
- **WUGs**: Tests morphological generalization in LMs through an adjective nominalization and past tense task. (Hofmann et al., 2024) (Weissweiler et al., 2023)
- **COMPS**: Property knowledge. (Misra et al., 2023)
*Finetuning Tasks*
- **BoolQ**: A yes/no QA dataset with unprompted and unconstrained questions. (Clark et al., NAACL 2019)
- **MNLI**: The Multi-Genre Natural Language Inference corpus tests the language understanding of a model by seeing wehther it can recognize textual entailment. (Williams et al., NAACL 2018)
- **MRPC**: The Microsoft Research Paraphrase Corpus contains pairs of sentences that are either paraphrases/semntically equivalent to each other or unrelated.(Dolan & Brockett, IJCNLP 2005)
- **QQP**<sup>2</sup>: Similarly to MRPC, the Quora Question Pairs corpus tests the models ability to determine whether a pair of questions are sematically similar to each other. These questions are sourced from Quora.
- **MultiRC**: The Multi-Sentence Reading Comprehension corpus is a QA task that evaluates the model's ability to the correct answer from a list of answers given a question and context paragraph. In this version the data is changed to a binary classification judging whether the answer to a question, context pair is correct. (Khashabi et al., NAACL 2018)
- **RTE**: Similar the Recognizing Text Entailement tests the model's ability to recognize text entailement. (Dagan et al., Springer 2006; Bar et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., TAC 2009)
- **WSC**: The Winograd Schema Challenge tests the models ability to do coreference resolution on sentences with a pronoun and a list of noun phrases found in the sentence. This version edits it to be a binary classification on examples consisting of a pronoun and noun phrase.(Levesque et al., PKRR 2012)
<sup>2</sup> https://www.quora.com/profile/Ricky-Riche-2/First-Quora-Dataset-Release-Question-Pairs
### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
The metrics used to evaluate the model are the following:
- Zero-shot
- Accuracy on predicting the correct completion/sentence for BLiMP, BLiMP Supplement, EWoK, Entity Tracking, and WUGs
- Change in R^2 prediction from baseline for Eye Tracking (with no spillover) and Self-paced Reading (1-word spillover)
- Finetuning
- 3 class Accuracy for MNLI
- Binary Accuracy for BoolQ, MultiRC, and WSC
- F1-score for MRPC and QQP
The metrics were chosen based on the advice of the papers the tasks come from.
### Hyperparameters
| Hyperparameter | MNLI, RTE, QQP, MRPC, BoolQ, MultiRC | WSC |
| --- | --- | --- |
| Learning Rate | 3\*10<sup>-5</sup> | 3\*10<sup>-5</sup> |
| Batch Size | 16 | 16 |
| Epochs | 10 | 30 |
| Weight decay | 0.01 | 0.01 |
| Optimizer | AdamW | AdamW |
| Scheduler | cosine | cosine |
| Warmup percentage | 6% | 6% |
| Dropout | 0.1 | 0.1 |
## Results
We compare our student model against two official baselines from the 2025 BabyLM Challenge<sup>1</sup>:
- **1000M-pre:** The standard *pretraining* baseline, using a GPT-2-small model trained on 100M unique words from the BabyLM dataset (10 epochs, next-word prediction).
- **SimPO:** A baseline first trained for 7 epochs with next-word prediction, then 2 epochs *interleaving* prediction and reinforcement learning. Here, the RL reward encourages the student to generate completions similar to the teacher’s output.
- **900M-pre:** Our model, using the same GPT-2-small architecture, pretrained on 90% of the BabyLM dataset (yielding approximately 91M unique words, 10 epochs).
- **900M-RL:** Our model after additional PPO-based reinforcement learning with the teacher, using about 1M words as input for the interactive (RL) phase.
---
### Evaluation Results
| **Task** | **1000M-pre** | **SimPO** | **900M-pre** | **900M-RL** |
|:------------- | ------------: | ---------:| ------------:| -----------:|
| BLiMP | 74.88 | 72.16 | 77.52 | **77.53** |
| Suppl. | **63.32** | 61.22 | 56.62 | 56.72 |
| EWOK | 51.67 | **51.92** | 51.36 | 51.41 |
| COMPS | **56.17** | 55.05 | 55.20 | 55.18 |
| ET | 31.51 | 28.06 | 30.34 | **33.11** |
| GLUE | 52.18 | 50.35 | **53.14** | 52.46 |
#### Model descriptions:
- **1000M-pre:** Baseline pretrained on 100M words (BabyLM challenge baseline).
- **SimPO:** Baseline using a hybrid of pretraining and RL with a similarity-based reward.
- **900M-pre:** Our GPT-2-small model, pretrained on 90M words (similar settings as baseline, but less data).
- **900M-RL:** The same model as 900M-pre, further trained with PPO using teacher feedback on 1M words of input.
-
See: [BabyLM Challenge](https://huggingface.co/BabyLM-community) for the baselines.
# Technical Specifications
### Hardware
- 4 A100 GPUs were used to train this model.
### Software
PyTorch
### Training Time
The model took 2.5 hours to train and consumed 755 core hours (with 4 GPUs and 32 CPUs).
# Citation
```latex
@misc{MayerMartinsBKB2025,
title={Once Upon a Time: Interactive Learning for Storytelling with Small Language Models},
author={Jonas Mayer Martins, Ali Hamza Bashir, Muhammad Rehan Khalid, Lisa Beinborn},
year={2025},
eprint={2502.TODO},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={ToDo},
}
```
# Model Card Authors
Jonas Mayer Martins
# Bibliography
[GLUE: A multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7) (Wang et al., ICLR 2019)
[SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems](https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf) (Wang et al., NeurIPS 2019)
[BLiMP: The Benchmark of Linguistic Minimal Pairs for English](https://aclanthology.org/2020.tacl-1.25/) (Warstadt et al., TACL 2020)
[Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora](https://aclanthology.org/2023.conll-babylm.1/) (Warstadt et al., CoNLL-BabyLM 2023)
[🌏 Elements of World Knowledge (EWoK): A cognition-inspired framework for evaluating basic world knowledge in language models](https://arxiv.org/pdf/2405.09605v1) (Ivanova et al., 2024)
[Cloze probability, predictability ratings, and computational estimates for 205 English sentences, aligned with existing EEG and reading time data](https://link.springer.com/article/10.3758/s13428-023-02261-8) (de Varda et al., BRM 2024)
[Entity Tracking in Language Models](https://aclanthology.org/2023.acl-long.213/) (Kim & Schuster, ACL 2023)
[Derivational Morphology Reveals Analogical Generalization in Large Language Models](https://arxiv.org/pdf/2411.07990) (Hofmann et al., 2024)
[Automatically Constructing a Corpus of Sentential Paraphrases](https://aclanthology.org/I05-5002/) (Dolan & Brockett, IJCNLP 2005)
[A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference](https://aclanthology.org/N18-1101/) (Williams et al., NAACL 2018)
[The Winograd Schema Challenge]( http://dl.acm.org/citation.cfm?id=3031843.3031909) (Levesque et al., PKRR 2012)
[The PASCAL Recognising Textual Entailment Challenge](https://link.springer.com/chapter/10.1007/11736790_9) (Dagan et al., Springer 2006)
[The Second PASCAL Recognising Textual Entailment Challenge]() (Bar et al., 2006)
[The Third PASCAL Recognizing Textual Entailment Challenge](https://aclanthology.org/W07-1401/) (Giampiccolo et al., 2007)
[The Fifth PASCAL Recognizing Textual Entailment Challenge](https://tac.nist.gov/publications/2009/additional.papers/RTE5_overview.proceedings.pdf) (Bentivogli et al., TAC 2009)
[BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions](https://aclanthology.org/N19-1300/) (Clark et al., NAACL 2019)
[Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences](https://aclanthology.org/N18-1023/) (Khashabi et al., NAACL 2018)