GPQA Diamond

← Back to all benchmarks

Date: 2023-11-20

Name: GPQA Diamond

Domain: Biology & Medicine, Chemistry, High Energy Physics

Focus: Graduate-level scientific reasoning

Task Types: Multiple choice, Multi-step QA

Metrics: Accuracy

Models: o1, DeepSeek-R1

AI/ML Motif: Reasoning & Generalization

Resources

Benchmark: Visit

Keywords

Google-proof graduate-level science QA chemistry physics

Citation

David Rein, Betty Li Hou, and Asa Cooper Stickland. Gpqa: a graduate-level google-proof q and a benchmark. 2023. URL: https://arxiv.org/abs/2311.12022.

@misc{rein2023gpqagraduatelevelgoogleproofqa,
  title={GPQA: A Graduate-Level Google-Proof Q and A Benchmark},
  author={Rein, David and Hou, Betty Li and Stickland, Asa Cooper},
  year={2023},
  url={https://arxiv.org/abs/2311.12022}
}

Ratings

CategoryRating

Software

5.00

Python version and requirements specified on Github site

Specification

2.00

No system constraints or I/O specified

Dataset

5.00

Easily able to access dataset. Comes with predefined splits as mentioned in the paper

Metrics

5.00

Each question has a correct answer, representing the tested model's performance.

Reference Solution

1.00

Common models such as GPT-3.5 were compared. They are not open and don't provide requirements

Documentation

5.00

All information is listed in the associated paper

Average rating: 3.83/5

Radar plot

$GPQA Diamond radar$

Edit: edit this entry