File size: 1,993 Bytes
b0c0df0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# AIME
## Paper
The American Invitational Mathematics Examination (AIME) is a selective and prestigious 15-question 3-hour test given to high school students who qualify based on their AMC 10 or AMC 12 scores. All problems have integer answers between 0 and 999 inclusive. Questions increase in difficulty as the exam progresses.
The AIME dataset evaluates mathematical problem-solving capabilities on competition-level mathematics problems.
Homepage: https://huggingface.co/datasets/simplescaling/aime_nofigures
## Dataset
This implementation includes both:
- `aime_nofigures`: AIME problems without figures/diagrams
- `aime_figures`: AIME problems with figures/diagrams
The dataset uses problems from AIME competitions, formatted for language model evaluation.
## Groups and Tasks
#### Groups
- `math_word_problems`
#### Tasks
- `aime_nofigures`: AIME problems without figures
- `aime_figures`: AIME problems with figures
- `aime24_nofigures`: AIME 2024 problems without figures
- `aime24_figures`: AIME 2024 problems with figures
- `aime25_nofigures`: AIME 2025 problems without figures
- Various aggregated versions (agg8, agg64) for multiple sampling
### Evaluation
The evaluation checks if the model's output matches the correct integer answer (0-999). The implementation includes:
- Answer extraction from model outputs
- Support for boxed answers (e.g., `\boxed{123}`)
- Optional GPT-4o-mini based answer extraction for complex formats
- Coverage and majority voting metrics for aggregated tasks
### Environment Variables
- `PROCESSOR=gpt-4o-mini`: Use GPT-4o-mini for answer extraction
- `PROMPTSTEP`: Add thinking steps prompt
- `PROMPTTOKEN`: Add thinking tokens prompt
- `PROMPTLONG`: Add long thinking prompt
- `PROMPTSHORT`: Add short thinking prompt
### Checklist
- [ ] Is in Eval-harness v1.0?
- [ ] Has been checked for regression from v1.0?
- [ ] Has been checked for equivalence with original paper methodology?
- [ ] "Main" checked variant clearly denoted? |