FrontierMath

← Back to all benchmarks

Date: 2024-11-07

Name: FrontierMath

Domain: Mathematics

Focus: Challenging advanced mathematical reasoning

Task Types: Problem solving

Metrics: Accuracy

Models: unknown

AI/ML Motif: Reasoning & Generalization

Resources

Benchmark: Visit

Keywords

symbolic reasoning number theory algebraic geometry category theory

Citation

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, Shreepranav Varma Enugandla, and Mark Wildon. Frontiermath: a benchmark for evaluating advanced mathematical reasoning in ai. 2024. URL: https://arxiv.org/abs/2411.04872, arXiv:2411.04872.

@misc{glazer2024frontiermathbenchmarkevaluatingadvanced,
  archiveprefix = {arXiv},
  author        = {Elliot Glazer and Ege Erdil and Tamay Besiroglu and Diego Chicharro and Evan Chen and Alex Gunning and Caroline Falkman Olsson and Jean-Stanislas Denain and Anson Ho and Emily de Oliveira Santos and Olli J\"{a}rviniemi and Matthew Barnett and Robert Sandler and Matej Vrzala and Jaime Sevilla and Qiuyu Ren and Elizabeth Pratt and Lionel Levine and Grant Barkley and Natalie Stewart and Bogdan Grechuk and Tetiana Grechuk and Shreepranav Varma Enugandla and Mark Wildon},
  eprint        = {2411.04872},
  primaryclass  = {cs.AI},
  title         = {FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI},
  url           = {https://arxiv.org/abs/2411.04872},
  year          = {2024}
}

Ratings

CategoryRating

Software

0.00

No publicaly available code to run the benchmark

Specification

3.00

Well-specified process for asking questions and receiving answers. No software or hardware constraints

Dataset

0.00

Only samples of dataset exist, not publicly available

Metrics

5.00

All questions in the dataset have a correct answer

Reference Solution

2.00

Displays result of leading models on the benchmark, but none are trainable or list constraints

Documentation

5.00

All necessary information is in the paper and website

Average rating: 2.50/5

Radar plot

$FrontierMath radar$

Edit: edit this entry