Skip to content

FrontierMath

← Back to all benchmarks

Date: 2024-11-07

Name: FrontierMath

Domain: Mathematics

Focus: Challenging advanced mathematical reasoning

Task Types: Problem solving

Metrics: Accuracy

Models: unknown

AI/ML Motif: Reasoning & Generalization

Resources

Benchmark: Visit

Keywords

Citation

  • Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, Shreepranav Varma Enugandla, and Mark Wildon. Frontiermath: a benchmark for evaluating advanced mathematical reasoning in ai. 2024. URL: https://arxiv.org/abs/2411.04872, arXiv:2411.04872.
@misc{glazer2024frontiermathbenchmarkevaluatingadvanced,
  archiveprefix = {arXiv},
  author        = {Elliot Glazer and Ege Erdil and Tamay Besiroglu and Diego Chicharro and Evan Chen and Alex Gunning and Caroline Falkman Olsson and Jean-Stanislas Denain and Anson Ho and Emily de Oliveira Santos and Olli J\"{a}rviniemi and Matthew Barnett and Robert Sandler and Matej Vrzala and Jaime Sevilla and Qiuyu Ren and Elizabeth Pratt and Lionel Levine and Grant Barkley and Natalie Stewart and Bogdan Grechuk and Tetiana Grechuk and Shreepranav Varma Enugandla and Mark Wildon},
  eprint        = {2411.04872},
  primaryclass  = {cs.AI},
  title         = {FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI},
  url           = {https://arxiv.org/abs/2411.04872},
  year          = {2024}
}

Ratings

CategoryRating
Software
0.00
No publicaly available code to run the benchmark
Specification
3.00
Well-specified process for asking questions and receiving answers. No software or hardware constraints
Dataset
0.00
Only samples of dataset exist, not publicly available
Metrics
5.00
All questions in the dataset have a correct answer
Reference Solution
2.00
Displays result of leading models on the benchmark, but none are trainable or list constraints
Documentation
5.00
All necessary information is in the paper and website
Average rating: 2.50/5

Radar plot

FrontierMath radar

Edit: edit this entry