PRM800K

← Back to all benchmarks

Date: 2023-05-30

Name: PRM800K

Domain: Mathematics

Focus: Math reasoning generalization

Task Types: Problem solving

Metrics: Accuracy

Models: GPT-4

AI/ML Motif: Reasoning & Generalization

Resources

Benchmark: Visit

Datasets: PRM800K: A Process Supervision Dataset

Results: Let's Verify Step by Step

Keywords

calculus algebra number theory geometry

Citation

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023. arXiv:arXiv:2305.20050, doi:10.48550/arXiv.2305.20050.

@article{lightman2023lets,
      title={Let's Verify Step by Step}, 
      author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl},
      journal={arXiv preprint arXiv:2305.20050},
      year={2023},
      Eprint = {arXiv:2305.20050},
      doi = {10.48550/arXiv.2305.20050}
}

Ratings

CategoryRating

Software

3.00

Code is provided in the PRM800K Repo for evaluation and grading, documentation is present but no environment details, baseline model, or training code is given

Specification

4.00

Task is well specified, format, inputs, and outputs are mentioned. No system constraints are provided.

Dataset

5.00

Dataset follows all FAIR Principles. Train/Test splits are available in the PRM800K repo

Metrics

4.00

Correctness is used as the primary metric, with grading guidelines provided.

Reference Solution

2.00

A reference solution is mentioned in the "Lets Verify Step by Step" paper, but the model is not open-sourced.

Documentation

5.00

Documentation is present in the PRM800K repo and "Lets Verify Step by Step" paper.

Average rating: 3.83/5

Radar plot

$PRM800K radar$

Edit: edit this entry