Skip to content

PRM800K

← Back to all benchmarks

Date: 2023-05-30

Name: PRM800K

Domain: Mathematics

Focus: Math reasoning generalization

Task Types: Problem solving

Metrics: Accuracy

Models: GPT-4

AI/ML Motif: Reasoning & Generalization

Keywords

Citation

  • Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023. arXiv:arXiv:2305.20050, doi:10.48550/arXiv.2305.20050.
@article{lightman2023lets,
      title={Let's Verify Step by Step}, 
      author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl},
      journal={arXiv preprint arXiv:2305.20050},
      year={2023},
      Eprint = {arXiv:2305.20050},
      doi = {10.48550/arXiv.2305.20050}
}

Ratings

CategoryRating
Software
3.00
Code is provided in the PRM800K Repo for evaluation and grading, documentation is present but no environment details, baseline model, or training code is given
Specification
4.00
Task is well specified, format, inputs, and outputs are mentioned. No system constraints are provided.
Dataset
5.00
Dataset follows all FAIR Principles. Train/Test splits are available in the PRM800K repo
Metrics
4.00
Correctness is used as the primary metric, with grading guidelines provided.
Reference Solution
2.00
A reference solution is mentioned in the "Lets Verify Step by Step" paper, but the model is not open-sourced.
Documentation
5.00
Documentation is present in the PRM800K repo and "Lets Verify Step by Step" paper.
Average rating: 3.83/5

Radar plot

PRM800K radar

Edit: edit this entry