|
--- |
|
language: |
|
- en |
|
tags: |
|
- babylm-baseline |
|
- strict |
|
- babylm-2025 |
|
--- |
|
|
|
# Model Card for the Preference Optimization Interaction Baseline |
|
|
|
<!-- Provide a quick summary of what the model is/does. [Optional] --> |
|
A 124M model with the GPT-2 architecture trained with the next token prediction loss for 10 epochs (~900M words) **on 90% of the BabyLM corpus**, as a naive autoregressive baseline for the Interaction track of the 2025 BabyLM challenge. |
|
|
|
This model card is based on the model card of the BabyLM [100M GPT-2 baseline](https://huggingface.co/BabyLM-community/babylm-baseline-100m-gpt2/edit/main/README.md). |
|
|
|
# Table of Contents |
|
|
|
- [Model Card for Interaction GPT-2 Baseline](#model-card-for--model_id-) |
|
- [Table of Contents](#table-of-contents) |
|
- [Model Details](#model-details) |
|
- [Model Description](#model-description) |
|
- [Uses](#uses) |
|
- [Training Details](#training-details) |
|
- [Training Data](#training-data) |
|
- [Hyperparameters](#hyperparameters) |
|
- [Training Procedure](#training-procedure) |
|
- [Size and Checkpoints](#size-and-checkpoints) |
|
- [Evaluation](#evaluation) |
|
- [Testing Data & Metrics](#testing-data-factors--metrics) |
|
- [Testing Data](#testing-data) |
|
- [Metrics](#metrics) |
|
- [Results](#results) |
|
- [Technical Specifications](#technical-specifications-optional) |
|
- [Model Architecture and Objective](#model-architecture-and-objective) |
|
- [Compute Infrastructure](#compute-infrastructure) |
|
- [Hardware](#hardware) |
|
- [Software](#software) |
|
- [Training Time](#training-time) |
|
- [Citation](#citation) |
|
- [Model Card Authors](#model-card-authors-optional) |
|
- [Bibliography](#bibliography) |
|
|
|
|
|
# Model Details |
|
|
|
## Model Description |
|
|
|
<!-- Provide a longer summary of what this model is/does. --> |
|
This is the pretrained GPT-2 model as a basis for PPO finetuning for the Interaction Track of the 2025 BabyLM challenge. |
|
|
|
- **Developed by:** Jonas Mayer Martins, Ali Hamza Bashir, Muhammad Rehan Khalid |
|
- **Model type:** Causal language model |
|
- **Language(s) (NLP):** eng |
|
- **Resources for more information:** |
|
- [GitHub Repo](https://github.com/malihamza/babylm-interactive-learning) |
|
|
|
|
|
# Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
This is a pre-trained language model. |
|
It can be used to evaluate tasks in a zero-shot manner and also can be fine-tuned for downstream tasks. |
|
It can be used for language generation but given its small size and low number of words trained on, do not expect LLM-level performance. |
|
|
|
# Training Details |
|
|
|
## Training Data |
|
|
|
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
We used the BabyLM 100M (Strict) dataset for training. **We trained the tokenizer and model on randomly selected 90% of the corpus**, which is composed of the following: |
|
|
|
| Source | Weight | Domain | Citation | Website | License | |
|
| --- | --- | --- | --- | --- | --- | |
|
| BNC | 8% | Dialogue | BNC Consortium (2007) | [link](http://www.natcorp.ox.ac.uk/) | [link](http://www.natcorp.ox.ac.uk/docs/licence.html) <sup>1</sup> | |
|
| CHILDES | 29% | Dialogue, Child-Directed | MacWhinney (2000) | | [link](https://talkbank.org/share/rules.html) | |
|
| Project Gutenberg | 26% | Fiction, Nonfiction | Gerlach & Font-Clos (2020) | [link](https://github.com/pgcorpus/gutenberg) | [link](https://www.gutenberg.org/policy/license.html) | |
|
| OpenSubtitles | 20% | Dialogue, Scripted | Lison & Tiedermann (2016) | [link](https://opus.nlpl.eu/OpenSubtitles-v2018.php) | Open source | |
|
| Simple English Wikipedia | 15% | Nonfiction | -- | [link](https://dumps.wikimedia.org/simplewiki/20221201/) | [link](https://dumps.wikimedia.org/legal.html) | |
|
| Switchboard | 1% | Dialogue | Godfrey et al. (1992), Stolcke et al., (2000) | [link](http://compprag.christopherpotts.net/swda.html) | [link](http://compprag.christopherpotts.net/swda.html) | |
|
|
|
<sup>1</sup> Our distribution of part of the BNC Texts is permitted under the fair dealings provision of copyright law (see term (2g) in the BNC license). |
|
|
|
## Hyperparameters |
|
|
|
| Hyperparameter | Value | |
|
| --- | --- | |
|
| Number of epochs | 10 | |
|
| Datapoint length | 512 | |
|
| Batch size | 16 | |
|
| Gradient accumulation steps | 4 | |
|
| Learning rate | 0.0005 | |
|
| Number of steps | 211650 | |
|
| Warmup steps | 2116 | |
|
| Gradient clipping | 1 | |
|
| Optimizer | AdamW | |
|
| Optimizer Beta_1 | 0.9 | |
|
| Optimizer Beta_2 | 0.999 | |
|
| Optimizer Epsilon | 10<sup>-8</sup>| |
|
| Tokenizer | BytePairBPE | |
|
| Vocab Size | 16000 | |
|
|
|
## Training Procedure |
|
|
|
The model is trained with next token prediction loss for 10 epochs. |
|
|
|
### Size and checkpoints |
|
|
|
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
|
|
|
The model has 124M parameters. |
|
In total we train on around 1B words and provide multiple checkpoints from the training. |
|
Specifically we provode: |
|
- Checkpoints every 1M words for the first 10M words |
|
- Checkpoints every 10M words first 100M words |
|
- Checkpoints every 100M words until 1B words |
|
|
|
# Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
This model is evaluated in two ways: |
|
1. We do zero-shot evaluation on 7 tasks. |
|
2. We do fine-tuning on a subset of the (Super)GLUE tasks (Wang et al., ICLR 2019; Wang et al., NeurIPS 2019) . |
|
|
|
## Testing Data & Metrics |
|
|
|
### Testing Data |
|
|
|
<!-- This should link to a Data Card if possible. --> |
|
|
|
For the BLiMP, BLiMP supplement, and EWoK tasks, we use a filtered version of the dataset to only include examples with words found in the BabyLM dataset. |
|
For the Finetuning task, we both filter and sample down to a maximum 10 000 train examples. |
|
|
|
*Validation Data* |
|
|
|
*Zero-shot Tasks* |
|
|
|
- **BLiMP**: The Benchmark of Linguistic Minimal Pairs evaluates the model's linguistic ability by seeing if it can recognize the grammatically correct sentence from a pair of minimally different sentences. It tests various grammatical phenomena.(Warstadt et al., TACL 2020) |
|
- **BLiMP Supplement**: A supplement to BLiMP introduced in the first edition of the BabyLM challenge. More focused on dialogue and questions. (Warstadt et al., CoNLL-BabyLM 2023) |
|
- **EWoK**: Works similarly to BLiMP but looks the model's internal world knowledge. Looking at both whter a model has physical and social knowledge. (Ivanova et al., 2024) |
|
- **Eye Tracking and Self-paced Reading**: Looks at whether the model can mimick the eye tracking and reading time of a human but using surprisal of a word as a proxy for time spent reading a word. (de Varda et al., BRM 2024) |
|
- **Entity Tracking**: Checks whether a model can keep track of the changes to the states of entities as text/dialogue unfolds. (Kim & Schuster, ACL 2023) |
|
- **WUGs**: Tests morphological generalization in LMs through an adjective nominalization and past tense task. (Hofmann et al., 2024) (Weissweiler et al., 2023) |
|
- **COMPS**: Property knowledge. (Misra et al., 2023) |
|
|
|
*Finetuning Tasks* |
|
|
|
- **BoolQ**: A yes/no QA dataset with unprompted and unconstrained questions. (Clark et al., NAACL 2019) |
|
- **MNLI**: The Multi-Genre Natural Language Inference corpus tests the language understanding of a model by seeing wehther it can recognize textual entailment. (Williams et al., NAACL 2018) |
|
- **MRPC**: The Microsoft Research Paraphrase Corpus contains pairs of sentences that are either paraphrases/semntically equivalent to each other or unrelated.(Dolan & Brockett, IJCNLP 2005) |
|
- **QQP**<sup>2</sup>: Similarly to MRPC, the Quora Question Pairs corpus tests the models ability to determine whether a pair of questions are sematically similar to each other. These questions are sourced from Quora. |
|
- **MultiRC**: The Multi-Sentence Reading Comprehension corpus is a QA task that evaluates the model's ability to the correct answer from a list of answers given a question and context paragraph. In this version the data is changed to a binary classification judging whether the answer to a question, context pair is correct. (Khashabi et al., NAACL 2018) |
|
- **RTE**: Similar the Recognizing Text Entailement tests the model's ability to recognize text entailement. (Dagan et al., Springer 2006; Bar et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., TAC 2009) |
|
- **WSC**: The Winograd Schema Challenge tests the models ability to do coreference resolution on sentences with a pronoun and a list of noun phrases found in the sentence. This version edits it to be a binary classification on examples consisting of a pronoun and noun phrase.(Levesque et al., PKRR 2012) |
|
|
|
<sup>2</sup> https://www.quora.com/profile/Ricky-Riche-2/First-Quora-Dataset-Release-Question-Pairs |
|
|
|
### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
The metrics used to evaluate the model are the following: |
|
- Zero-shot |
|
- Accuracy on predicting the correct completion/sentence for BLiMP, BLiMP Supplement, EWoK, Entity Tracking, and WUGs |
|
- Change in R^2 prediction from baseline for Eye Tracking (with no spillover) and Self-paced Reading (1-word spillover) |
|
- Finetuning |
|
- 3 class Accuracy for MNLI |
|
- Binary Accuracy for BoolQ, MultiRC, and WSC |
|
- F1-score for MRPC and QQP |
|
|
|
The metrics were chosen based on the advice of the papers the tasks come from. |
|
|
|
### Hyperparameters |
|
|
|
|
|
| Hyperparameter | MNLI, RTE, QQP, MRPC, BoolQ, MultiRC | WSC | |
|
| --- | --- | --- | |
|
| Learning Rate | 3\*10<sup>-5</sup> | 3\*10<sup>-5</sup> | |
|
| Batch Size | 16 | 16 | |
|
| Epochs | 10 | 30 | |
|
| Weight decay | 0.01 | 0.01 | |
|
| Optimizer | AdamW | AdamW | |
|
| Scheduler | cosine | cosine | |
|
| Warmup percentage | 6% | 6% | |
|
| Dropout | 0.1 | 0.1 | |
|
|
|
## Results |
|
|
|
We compare our student model against two official baselines from the 2025 BabyLM Challenge<sup>1</sup>: |
|
|
|
- **1000M-pre:** The standard *pretraining* baseline, using a GPT-2-small model trained on 100M unique words from the BabyLM dataset (10 epochs, next-word prediction). |
|
- **SimPO:** A baseline first trained for 7 epochs with next-word prediction, then 2 epochs *interleaving* prediction and reinforcement learning. Here, the RL reward encourages the student to generate completions similar to the teacher’s output. |
|
- **900M-pre:** Our model, using the same GPT-2-small architecture, pretrained on 90% of the BabyLM dataset (yielding approximately 91M unique words, 10 epochs). |
|
- **900M-RL:** Our model after additional PPO-based reinforcement learning with the teacher, using about 1M words as input for the interactive (RL) phase. |
|
|
|
--- |
|
|
|
### Evaluation Results |
|
|
|
| **Task** | **1000M-pre** | **SimPO** | **900M-pre** | **900M-RL** | |
|
|:------------- | ------------: | ---------:| ------------:| -----------:| |
|
| BLiMP | 74.88 | 72.16 | 77.52 | **77.53** | |
|
| Suppl. | **63.32** | 61.22 | 56.62 | 56.72 | |
|
| EWOK | 51.67 | **51.92** | 51.36 | 51.41 | |
|
| COMPS | **56.17** | 55.05 | 55.20 | 55.18 | |
|
| ET | 31.51 | 28.06 | 30.34 | **33.11** | |
|
| GLUE | 52.18 | 50.35 | **53.14** | 52.46 | |
|
|
|
#### Model descriptions: |
|
- **1000M-pre:** Baseline pretrained on 100M words (BabyLM challenge baseline). |
|
- **SimPO:** Baseline using a hybrid of pretraining and RL with a similarity-based reward. |
|
- **900M-pre:** Our GPT-2-small model, pretrained on 90M words (similar settings as baseline, but less data). |
|
- **900M-RL:** The same model as 900M-pre, further trained with PPO using teacher feedback on 1M words of input. |
|
- |
|
See: [BabyLM Challenge](https://huggingface.co/BabyLM-community) for the baselines. |
|
|
|
# Technical Specifications |
|
|
|
### Hardware |
|
|
|
- 4 A100 GPUs were used to train this model. |
|
|
|
### Software |
|
|
|
PyTorch |
|
|
|
### Training Time |
|
|
|
The model took 2.5 hours to train and consumed 755 core hours (with 4 GPUs and 32 CPUs). |
|
|
|
# Citation |
|
|
|
```latex |
|
@misc{MayerMartinsBKB2025, |
|
title={Once Upon a Time: Interactive Learning for Storytelling with Small Language Models}, |
|
author={Jonas Mayer Martins, Ali Hamza Bashir, Muhammad Rehan Khalid, Lisa Beinborn}, |
|
year={2025}, |
|
eprint={2502.TODO}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={ToDo}, |
|
} |
|
``` |
|
|
|
# Model Card Authors |
|
|
|
Jonas Mayer Martins |
|
|
|
# Bibliography |
|
|
|
[GLUE: A multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7) (Wang et al., ICLR 2019) |
|
|
|
[SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems](https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf) (Wang et al., NeurIPS 2019) |
|
|
|
[BLiMP: The Benchmark of Linguistic Minimal Pairs for English](https://aclanthology.org/2020.tacl-1.25/) (Warstadt et al., TACL 2020) |
|
|
|
[Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora](https://aclanthology.org/2023.conll-babylm.1/) (Warstadt et al., CoNLL-BabyLM 2023) |
|
|
|
[🌏 Elements of World Knowledge (EWoK): A cognition-inspired framework for evaluating basic world knowledge in language models](https://arxiv.org/pdf/2405.09605v1) (Ivanova et al., 2024) |
|
|
|
[Cloze probability, predictability ratings, and computational estimates for 205 English sentences, aligned with existing EEG and reading time data](https://link.springer.com/article/10.3758/s13428-023-02261-8) (de Varda et al., BRM 2024) |
|
|
|
[Entity Tracking in Language Models](https://aclanthology.org/2023.acl-long.213/) (Kim & Schuster, ACL 2023) |
|
|
|
[Derivational Morphology Reveals Analogical Generalization in Large Language Models](https://arxiv.org/pdf/2411.07990) (Hofmann et al., 2024) |
|
|
|
[Automatically Constructing a Corpus of Sentential Paraphrases](https://aclanthology.org/I05-5002/) (Dolan & Brockett, IJCNLP 2005) |
|
|
|
[A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference](https://aclanthology.org/N18-1101/) (Williams et al., NAACL 2018) |
|
|
|
[The Winograd Schema Challenge]( http://dl.acm.org/citation.cfm?id=3031843.3031909) (Levesque et al., PKRR 2012) |
|
|
|
[The PASCAL Recognising Textual Entailment Challenge](https://link.springer.com/chapter/10.1007/11736790_9) (Dagan et al., Springer 2006) |
|
|
|
[The Second PASCAL Recognising Textual Entailment Challenge]() (Bar et al., 2006) |
|
|
|
[The Third PASCAL Recognizing Textual Entailment Challenge](https://aclanthology.org/W07-1401/) (Giampiccolo et al., 2007) |
|
|
|
[The Fifth PASCAL Recognizing Textual Entailment Challenge](https://tac.nist.gov/publications/2009/additional.papers/RTE5_overview.proceedings.pdf) (Bentivogli et al., TAC 2009) |
|
|
|
[BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions](https://aclanthology.org/N19-1300/) (Clark et al., NAACL 2019) |
|
|
|
[Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences](https://aclanthology.org/N18-1023/) (Khashabi et al., NAACL 2018) |