GPQA: A Graduate-Level Google-Proof Question and Answer Benchmark

← Back to all benchmarks

Date: 2023-11-20

Name: GPQA: A Graduate-Level Google-Proof Question and Answer Benchmark

Domain: Biology & Medicine, High Energy Physics, Chemistry

Focus: Graduate-level, expert-validated multiple-choice questions hard even with web access

Task Types: Multiple choice

Metrics: Accuracy

Models: GPT-4 baseline

AI/ML Motif: Reasoning & Generalization

Resources

Benchmark: Visit

Datasets: GPQA dataset

Results: Gemini LLM Deep Research

Keywords

Google-proof multiple-choice expert reasoning science QA

Citation

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: a graduate-level google-proof q and a benchmark. 2023. URL: https://arxiv.org/abs/2311.12022, arXiv:2311.12022.

@misc{rein2023gpqagraduatelevelgoogleproofqa2,
  archiveprefix = {arXiv},
  author        = {David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman},
  eprint        = {2311.12022},
  primaryclass  = {cs.AI},
  title         = {GPQA: A Graduate-Level Google-Proof Q and A Benchmark},
  url           = {https://arxiv.org/abs/2311.12022},
  year          = {2023}
}

Ratings

CategoryRating

Software

3.00

Dataset and benchmark materials are publicly available via HuggingFace and GitHub, but no integrated runnable code or software framework is provided.

Specification

5.00

Task is clearly defined as a multiple-choice benchmark requiring expert-level scientific reasoning. Input/output formats and evaluation criteria are well described.

Dataset

5.00

The GPQA dataset is publicly released, well curated, with metadata and clearly documented splits.

Metrics

5.00

Accuracy is the primary metric and is clearly defined and appropriate for multiple-choice QA.

Reference Solution

1.00

No baseline implementations or starter code are linked or provided for reproduction.

Documentation

3.00

Documentation includes dataset description and benchmark instructions, but lacks detailed usage tutorials or pipelines.

Average rating: 3.67/5

Radar plot

$GPQA: A Graduate-Level Google-Proof Question and Answer Benchmark radar$

Edit: edit this entry