csuhan's picture
Upload folder using huggingface_hub
b0c0df0 verified

AIME

Paper

The American Invitational Mathematics Examination (AIME) is a selective and prestigious 15-question 3-hour test given to high school students who qualify based on their AMC 10 or AMC 12 scores. All problems have integer answers between 0 and 999 inclusive. Questions increase in difficulty as the exam progresses.

The AIME dataset evaluates mathematical problem-solving capabilities on competition-level mathematics problems.

Homepage: https://huggingface.co/datasets/simplescaling/aime_nofigures

Dataset

This implementation includes both:

  • aime_nofigures: AIME problems without figures/diagrams
  • aime_figures: AIME problems with figures/diagrams

The dataset uses problems from AIME competitions, formatted for language model evaluation.

Groups and Tasks

Groups

  • math_word_problems

Tasks

  • aime_nofigures: AIME problems without figures
  • aime_figures: AIME problems with figures
  • aime24_nofigures: AIME 2024 problems without figures
  • aime24_figures: AIME 2024 problems with figures
  • aime25_nofigures: AIME 2025 problems without figures
  • Various aggregated versions (agg8, agg64) for multiple sampling

Evaluation

The evaluation checks if the model's output matches the correct integer answer (0-999). The implementation includes:

  • Answer extraction from model outputs
  • Support for boxed answers (e.g., \boxed{123})
  • Optional GPT-4o-mini based answer extraction for complex formats
  • Coverage and majority voting metrics for aggregated tasks

Environment Variables

  • PROCESSOR=gpt-4o-mini: Use GPT-4o-mini for answer extraction
  • PROMPTSTEP: Add thinking steps prompt
  • PROMPTTOKEN: Add thinking tokens prompt
  • PROMPTLONG: Add long thinking prompt
  • PROMPTSHORT: Add short thinking prompt

Checklist

  • Is in Eval-harness v1.0?
  • Has been checked for regression from v1.0?
  • Has been checked for equivalence with original paper methodology?
  • "Main" checked variant clearly denoted?