AIME

Paper

The American Invitational Mathematics Examination (AIME) is a selective and prestigious 15-question 3-hour test given to high school students who qualify based on their AMC 10 or AMC 12 scores. All problems have integer answers between 0 and 999 inclusive. Questions increase in difficulty as the exam progresses.

The AIME dataset evaluates mathematical problem-solving capabilities on competition-level mathematics problems.

Homepage: https://huggingface.co/datasets/simplescaling/aime_nofigures

Dataset

This implementation includes both:

aime_nofigures: AIME problems without figures/diagrams
aime_figures: AIME problems with figures/diagrams

The dataset uses problems from AIME competitions, formatted for language model evaluation.

Groups and Tasks

Groups

math_word_problems

Tasks

aime_nofigures: AIME problems without figures
aime_figures: AIME problems with figures
aime24_nofigures: AIME 2024 problems without figures
aime24_figures: AIME 2024 problems with figures
aime25_nofigures: AIME 2025 problems without figures
Various aggregated versions (agg8, agg64) for multiple sampling

Evaluation

The evaluation checks if the model's output matches the correct integer answer (0-999). The implementation includes:

Answer extraction from model outputs
Support for boxed answers (e.g., \boxed{123})
Optional GPT-4o-mini based answer extraction for complex formats
Coverage and majority voting metrics for aggregated tasks

Environment Variables

PROCESSOR=gpt-4o-mini: Use GPT-4o-mini for answer extraction
PROMPTSTEP: Add thinking steps prompt
PROMPTTOKEN: Add thinking tokens prompt
PROMPTLONG: Add long thinking prompt
PROMPTSHORT: Add short thinking prompt

Checklist

Is in Eval-harness v1.0?
Has been checked for regression from v1.0?
Has been checked for equivalence with original paper methodology?
"Main" checked variant clearly denoted?