Skip to content

MMLU (Massive Multitask Language Understanding)

← Back to all benchmarks

Date: 2020-09-07

Name: MMLU Massive Multitask Language Understanding

Domain: Computational Science & AI

Focus: Academic knowledge and reasoning across 57 subjects

Task Types: Multiple choice

Metrics: Accuracy

Models: GPT-4o, Gemini 1.5 Pro, o1, DeepSeek-R1

AI/ML Motif: Reasoning & Generalization

Keywords

Citation

  • Dan Hendrycks, Collin Burns, and Saurav Kadavath. Measuring massive multitask language understanding. 2021. URL: https://arxiv.org/abs/2009.03300.
@misc{hendrycks2021measuring,
  title={Measuring Massive Multitask Language Understanding},
  author={Hendrycks, Dan and Burns, Collin and Kadavath, Saurav},
  journal={arXiv preprint arXiv:2009.03300},
  year={2021},
  url={https://arxiv.org/abs/2009.03300}
}

Ratings

CategoryRating
Software
2.00
Some code is available on github to reproduce results via OpenAI API, but not well documented
Specification
4.00
No system constraints
Dataset
5.00
Meets all FAIR principles and properly versioned.
Metrics
5.00
Fully defined, represents a solution's performance.
Reference Solution
2.00
Reference models are available (i.e. GPT-3), but are not trainable or publicly documented
Documentation
5.00
Well-explained in a provided paper.
Average rating: 3.83/5

Radar plot

MMLU (Massive Multitask Language Understanding) radar

Edit: edit this entry