--- language: - en tags: - babylm-baseline - strict - babylm-2025 --- # Model Card for the Preference Optimization Interaction Baseline A 124M model with the GPT-2 architecture trained with the next token prediction loss for 10 epochs (~900M words) **on 90% of the BabyLM corpus**, as a naive autoregressive baseline for the Interaction track of the 2025 BabyLM challenge. This model card is based on the model card of the BabyLM [100M GPT-2 baseline](https://huggingface.co/BabyLM-community/babylm-baseline-100m-gpt2/edit/main/README.md). # Table of Contents - [Model Card for Interaction GPT-2 Baseline](#model-card-for--model_id-) - [Table of Contents](#table-of-contents) - [Model Details](#model-details) - [Model Description](#model-description) - [Uses](#uses) - [Training Details](#training-details) - [Training Data](#training-data) - [Hyperparameters](#hyperparameters) - [Training Procedure](#training-procedure) - [Size and Checkpoints](#size-and-checkpoints) - [Evaluation](#evaluation) - [Testing Data & Metrics](#testing-data-factors--metrics) - [Testing Data](#testing-data) - [Metrics](#metrics) - [Results](#results) - [Technical Specifications](#technical-specifications-optional) - [Model Architecture and Objective](#model-architecture-and-objective) - [Compute Infrastructure](#compute-infrastructure) - [Hardware](#hardware) - [Software](#software) - [Training Time](#training-time) - [Citation](#citation) - [Model Card Authors](#model-card-authors-optional) - [Bibliography](#bibliography) # Model Details ## Model Description This is the pretrained GPT-2 model as a basis for PPO finetuning for the Interaction Track of the 2025 BabyLM challenge. - **Developed by:** Jonas Mayer Martins, Ali Hamza Bashir, Muhammad Rehan Khalid - **Model type:** Causal language model - **Language(s) (NLP):** eng - **Resources for more information:** - [GitHub Repo](https://github.com/malihamza/babylm-interactive-learning) # Uses This is a pre-trained language model. It can be used to evaluate tasks in a zero-shot manner and also can be fine-tuned for downstream tasks. It can be used for language generation but given its small size and low number of words trained on, do not expect LLM-level performance. # Training Details ## Training Data We used the BabyLM 100M (Strict) dataset for training. **We trained the tokenizer and model on randomly selected 90% of the corpus**, which is composed of the following: | Source | Weight | Domain | Citation | Website | License | | --- | --- | --- | --- | --- | --- | | BNC | 8% | Dialogue | BNC Consortium (2007) | [link](http://www.natcorp.ox.ac.uk/) | [link](http://www.natcorp.ox.ac.uk/docs/licence.html) 1 | | CHILDES | 29% | Dialogue, Child-Directed | MacWhinney (2000) | | [link](https://talkbank.org/share/rules.html) | | Project Gutenberg | 26% | Fiction, Nonfiction | Gerlach & Font-Clos (2020) | [link](https://github.com/pgcorpus/gutenberg) | [link](https://www.gutenberg.org/policy/license.html) | | OpenSubtitles | 20% | Dialogue, Scripted | Lison & Tiedermann (2016) | [link](https://opus.nlpl.eu/OpenSubtitles-v2018.php) | Open source | | Simple English Wikipedia | 15% | Nonfiction | -- | [link](https://dumps.wikimedia.org/simplewiki/20221201/) | [link](https://dumps.wikimedia.org/legal.html) | | Switchboard | 1% | Dialogue | Godfrey et al. (1992), Stolcke et al., (2000) | [link](http://compprag.christopherpotts.net/swda.html) | [link](http://compprag.christopherpotts.net/swda.html) | 1 Our distribution of part of the BNC Texts is permitted under the fair dealings provision of copyright law (see term (2g) in the BNC license). ## Hyperparameters | Hyperparameter | Value | | --- | --- | | Number of epochs | 10 | | Datapoint length | 512 | | Batch size | 16 | | Gradient accumulation steps | 4 | | Learning rate | 0.0005 | | Number of steps | 211650 | | Warmup steps | 2116 | | Gradient clipping | 1 | | Optimizer | AdamW | | Optimizer Beta_1 | 0.9 | | Optimizer Beta_2 | 0.999 | | Optimizer Epsilon | 10-8| | Tokenizer | BytePairBPE | | Vocab Size | 16000 | ## Training Procedure The model is trained with next token prediction loss for 10 epochs. ### Size and checkpoints The model has 124M parameters. In total we train on around 1B words and provide multiple checkpoints from the training. Specifically we provode: - Checkpoints every 1M words for the first 10M words - Checkpoints every 10M words first 100M words - Checkpoints every 100M words until 1B words # Evaluation This model is evaluated in two ways: 1. We do zero-shot evaluation on 7 tasks. 2. We do fine-tuning on a subset of the (Super)GLUE tasks (Wang et al., ICLR 2019; Wang et al., NeurIPS 2019) . ## Testing Data & Metrics ### Testing Data For the BLiMP, BLiMP supplement, and EWoK tasks, we use a filtered version of the dataset to only include examples with words found in the BabyLM dataset. For the Finetuning task, we both filter and sample down to a maximum 10 000 train examples. *Validation Data* *Zero-shot Tasks* - **BLiMP**: The Benchmark of Linguistic Minimal Pairs evaluates the model's linguistic ability by seeing if it can recognize the grammatically correct sentence from a pair of minimally different sentences. It tests various grammatical phenomena.(Warstadt et al., TACL 2020) - **BLiMP Supplement**: A supplement to BLiMP introduced in the first edition of the BabyLM challenge. More focused on dialogue and questions. (Warstadt et al., CoNLL-BabyLM 2023) - **EWoK**: Works similarly to BLiMP but looks the model's internal world knowledge. Looking at both whter a model has physical and social knowledge. (Ivanova et al., 2024) - **Eye Tracking and Self-paced Reading**: Looks at whether the model can mimick the eye tracking and reading time of a human but using surprisal of a word as a proxy for time spent reading a word. (de Varda et al., BRM 2024) - **Entity Tracking**: Checks whether a model can keep track of the changes to the states of entities as text/dialogue unfolds. (Kim & Schuster, ACL 2023) - **WUGs**: Tests morphological generalization in LMs through an adjective nominalization and past tense task. (Hofmann et al., 2024) (Weissweiler et al., 2023) - **COMPS**: Property knowledge. (Misra et al., 2023) *Finetuning Tasks* - **BoolQ**: A yes/no QA dataset with unprompted and unconstrained questions. (Clark et al., NAACL 2019) - **MNLI**: The Multi-Genre Natural Language Inference corpus tests the language understanding of a model by seeing wehther it can recognize textual entailment. (Williams et al., NAACL 2018) - **MRPC**: The Microsoft Research Paraphrase Corpus contains pairs of sentences that are either paraphrases/semntically equivalent to each other or unrelated.(Dolan & Brockett, IJCNLP 2005) - **QQP**2: Similarly to MRPC, the Quora Question Pairs corpus tests the models ability to determine whether a pair of questions are sematically similar to each other. These questions are sourced from Quora. - **MultiRC**: The Multi-Sentence Reading Comprehension corpus is a QA task that evaluates the model's ability to the correct answer from a list of answers given a question and context paragraph. In this version the data is changed to a binary classification judging whether the answer to a question, context pair is correct. (Khashabi et al., NAACL 2018) - **RTE**: Similar the Recognizing Text Entailement tests the model's ability to recognize text entailement. (Dagan et al., Springer 2006; Bar et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., TAC 2009) - **WSC**: The Winograd Schema Challenge tests the models ability to do coreference resolution on sentences with a pronoun and a list of noun phrases found in the sentence. This version edits it to be a binary classification on examples consisting of a pronoun and noun phrase.(Levesque et al., PKRR 2012) 2 https://www.quora.com/profile/Ricky-Riche-2/First-Quora-Dataset-Release-Question-Pairs ### Metrics The metrics used to evaluate the model are the following: - Zero-shot - Accuracy on predicting the correct completion/sentence for BLiMP, BLiMP Supplement, EWoK, Entity Tracking, and WUGs - Change in R^2 prediction from baseline for Eye Tracking (with no spillover) and Self-paced Reading (1-word spillover) - Finetuning - 3 class Accuracy for MNLI - Binary Accuracy for BoolQ, MultiRC, and WSC - F1-score for MRPC and QQP The metrics were chosen based on the advice of the papers the tasks come from. ### Hyperparameters | Hyperparameter | MNLI, RTE, QQP, MRPC, BoolQ, MultiRC | WSC | | --- | --- | --- | | Learning Rate | 3\*10-5 | 3\*10-5 | | Batch Size | 16 | 16 | | Epochs | 10 | 30 | | Weight decay | 0.01 | 0.01 | | Optimizer | AdamW | AdamW | | Scheduler | cosine | cosine | | Warmup percentage | 6% | 6% | | Dropout | 0.1 | 0.1 | ## Results We compare our student model against two official baselines from the 2025 BabyLM Challenge1: - **1000M-pre:** The standard *pretraining* baseline, using a GPT-2-small model trained on 100M unique words from the BabyLM dataset (10 epochs, next-word prediction). - **SimPO:** A baseline first trained for 7 epochs with next-word prediction, then 2 epochs *interleaving* prediction and reinforcement learning. Here, the RL reward encourages the student to generate completions similar to the teacher’s output. - **900M-pre:** Our model, using the same GPT-2-small architecture, pretrained on 90% of the BabyLM dataset (yielding approximately 91M unique words, 10 epochs). - **900M-RL:** Our model after additional PPO-based reinforcement learning with the teacher, using about 1M words as input for the interactive (RL) phase. --- ### Evaluation Results | **Task** | **1000M-pre** | **SimPO** | **900M-pre** | **900M-RL** | |:------------- | ------------: | ---------:| ------------:| -----------:| | BLiMP | 74.88 | 72.16 | 77.52 | **77.53** | | Suppl. | **63.32** | 61.22 | 56.62 | 56.72 | | EWOK | 51.67 | **51.92** | 51.36 | 51.41 | | COMPS | **56.17** | 55.05 | 55.20 | 55.18 | | ET | 31.51 | 28.06 | 30.34 | **33.11** | | GLUE | 52.18 | 50.35 | **53.14** | 52.46 | #### Model descriptions: - **1000M-pre:** Baseline pretrained on 100M words (BabyLM challenge baseline). - **SimPO:** Baseline using a hybrid of pretraining and RL with a similarity-based reward. - **900M-pre:** Our GPT-2-small model, pretrained on 90M words (similar settings as baseline, but less data). - **900M-RL:** The same model as 900M-pre, further trained with PPO using teacher feedback on 1M words of input. - See: [BabyLM Challenge](https://huggingface.co/BabyLM-community) for the baselines. # Technical Specifications ### Hardware - 4 A100 GPUs were used to train this model. ### Software PyTorch ### Training Time The model took 2.5 hours to train and consumed 755 core hours (with 4 GPUs and 32 CPUs). # Citation ```latex @misc{MayerMartinsBKB2025, title={Once Upon a Time: Interactive Learning for Storytelling with Small Language Models}, author={Jonas Mayer Martins, Ali Hamza Bashir, Muhammad Rehan Khalid, Lisa Beinborn}, year={2025}, eprint={2502.TODO}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={ToDo}, } ``` # Model Card Authors Jonas Mayer Martins # Bibliography [GLUE: A multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7) (Wang et al., ICLR 2019) [SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems](https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf) (Wang et al., NeurIPS 2019) [BLiMP: The Benchmark of Linguistic Minimal Pairs for English](https://aclanthology.org/2020.tacl-1.25/) (Warstadt et al., TACL 2020) [Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora](https://aclanthology.org/2023.conll-babylm.1/) (Warstadt et al., CoNLL-BabyLM 2023) [🌏 Elements of World Knowledge (EWoK): A cognition-inspired framework for evaluating basic world knowledge in language models](https://arxiv.org/pdf/2405.09605v1) (Ivanova et al., 2024) [Cloze probability, predictability ratings, and computational estimates for 205 English sentences, aligned with existing EEG and reading time data](https://link.springer.com/article/10.3758/s13428-023-02261-8) (de Varda et al., BRM 2024) [Entity Tracking in Language Models](https://aclanthology.org/2023.acl-long.213/) (Kim & Schuster, ACL 2023) [Derivational Morphology Reveals Analogical Generalization in Large Language Models](https://arxiv.org/pdf/2411.07990) (Hofmann et al., 2024) [Automatically Constructing a Corpus of Sentential Paraphrases](https://aclanthology.org/I05-5002/) (Dolan & Brockett, IJCNLP 2005) [A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference](https://aclanthology.org/N18-1101/) (Williams et al., NAACL 2018) [The Winograd Schema Challenge]( http://dl.acm.org/citation.cfm?id=3031843.3031909) (Levesque et al., PKRR 2012) [The PASCAL Recognising Textual Entailment Challenge](https://link.springer.com/chapter/10.1007/11736790_9) (Dagan et al., Springer 2006) [The Second PASCAL Recognising Textual Entailment Challenge]() (Bar et al., 2006) [The Third PASCAL Recognizing Textual Entailment Challenge](https://aclanthology.org/W07-1401/) (Giampiccolo et al., 2007) [The Fifth PASCAL Recognizing Textual Entailment Challenge](https://tac.nist.gov/publications/2009/additional.papers/RTE5_overview.proceedings.pdf) (Bentivogli et al., TAC 2009) [BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions](https://aclanthology.org/N19-1300/) (Clark et al., NAACL 2019) [Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences](https://aclanthology.org/N18-1023/) (Khashabi et al., NAACL 2018)