Skip to content

GPQA Diamond

← Back to all benchmarks

Date: 2023-11-20

Name: GPQA Diamond

Domain: Biology & Medicine, Chemistry, High Energy Physics

Focus: Graduate-level scientific reasoning

Task Types: Multiple choice, Multi-step QA

Metrics: Accuracy

Models: o1, DeepSeek-R1

AI/ML Motif: Reasoning & Generalization

Resources

Benchmark: Visit

Keywords

Citation

  • David Rein, Betty Li Hou, and Asa Cooper Stickland. Gpqa: a graduate-level google-proof q and a benchmark. 2023. URL: https://arxiv.org/abs/2311.12022.
@misc{rein2023gpqagraduatelevelgoogleproofqa,
  title={GPQA: A Graduate-Level Google-Proof Q and A Benchmark},
  author={Rein, David and Hou, Betty Li and Stickland, Asa Cooper},
  year={2023},
  url={https://arxiv.org/abs/2311.12022}
}

Ratings

CategoryRating
Software
5.00
Python version and requirements specified on Github site
Specification
2.00
No system constraints or I/O specified
Dataset
5.00
Easily able to access dataset. Comes with predefined splits as mentioned in the paper
Metrics
5.00
Each question has a correct answer, representing the tested model's performance.
Reference Solution
1.00
Common models such as GPT-3.5 were compared. They are not open and don't provide requirements
Documentation
5.00
All information is listed in the associated paper
Average rating: 3.83/5

Radar plot

GPQA Diamond radar

Edit: edit this entry