Index of Benchmarks
MMLU (Massive Multitask Language Understanding)
Computational Science & AI • Accuracy
Avg rating: 3.83/5
ClimateLearn - Weather Forcasting
Climate & Earth Science • RMSE, Anomaly correlation
Avg rating: 5.00/5
ClimateLearn - Downscaling
Climate & Earth Science • RMSE, Anomaly correlation
Avg rating: 5.00/5
ClimateLearn - Climate Projection
Climate & Earth Science • RMSE, Anomaly correlation
Avg rating: 5.00/5
CFDBench (Fluid Dynamics)
Mathematics • L2 error, MAE
Avg rating: 3.33/5
SatImgNet
Climate & Earth Science • Accuracy
Avg rating: 3.83/5
Quantum Computing Benchmarks (QML)
Computational Science & AI • Fidelity, Success probability
Avg rating: 3.17/5
OCP (Open Catalyst Project)
Chemistry, Materials Science • MAE (energy), MAE (force)
Avg rating: 4.17/5
MedQA
Biology & Medicine • Accuracy
Avg rating: 3.50/5
Materials Project
Materials Science • MAE, R^2
Avg rating: 1.92/5
GPQA Diamond
Biology & Medicine, Chemistry, High Energy Physics • Accuracy
Avg rating: 3.83/5
ARC-Challenge (Advanced Reasoning Challenge)
Computational Science & AI • Accuracy
Avg rating: 4.83/5
FrontierMath
Mathematics • Accuracy
Avg rating: 2.50/5
SciCode
Computational Science & AI • Solve rate (%)
Avg rating: 4.50/5
AIME (American Invitational Mathematics Examination)
Mathematics • Accuracy
Avg rating: 2.33/5
PRM800K
Mathematics • Accuracy
Avg rating: 3.83/5
CURIE (Scientific Long-Context Understanding, Reasoning and Information Extraction)
Materials Science, High Energy Physics, Biology & Medicine, Chemistry, Climate & Earth Science • Accuracy
Avg rating: 3.33/5
FEABench (Finite Element Analysis Benchmark): Evaluating Language Models on Multiphysics Reasoning Ability
Mathematics • Solve time, Error norm
Avg rating: 3.83/5
SPIQA (Scientific Paper Image Question Answering)
Computational Science & AI • Accuracy, F1 score
Avg rating: 3.67/5
BaisBench (Biological AI Scientist Benchmark) - Question Answering
Biology & Medicine • Annotation accuracy, QA accuracy
Avg rating: 4.00/5
BaisBench (Biological AI Scientist Benchmark) - Cell Type Annotation
Biology & Medicine • Annotation accuracy, QA accuracy
Avg rating: 4.00/5
MOLGEN
Chemistry • Validity%, Novelty%, QED, Docking score, penalized logP
Avg rating: 4.83/5
Open Graph Benchmark (OGB) - Biology
Biology & Medicine • Accuracy, ROC-AUC
Avg rating: 4.83/5
Jet Classification
High Energy Physics • Accuracy, AUC
Avg rating: 4.17/5
Irregular Sensor Data Compression
High Energy Physics • MSE, Compression ratio
Avg rating: 4.17/5
Beam Control
High Energy Physics • Stability, Control loss
Avg rating: 3.00/5
Ultrafast jet classification at the HL-LHC
High Energy Physics • Accuracy, Latency, Resource utilization
Avg rating: 3.17/5
Quench detection
High Energy Physics • ROC-AUC, Detection latency
Avg rating: 2.17/5
Intelligent experiments through real-time AI
High Energy Physics • Accuracy (charm and beauty detection), Latency (micros), Resource utilization (LUT/FF/BRAM/DSP)
Avg rating: 3.00/5
Neural Architecture Codesign for Fast Physics Applications
High Energy Physics • Accuracy, Latency, Resource utilization
Avg rating: 3.83/5
Smart Pixels for LHC
High Energy Physics • Data rejection rate, Power per pixel
Avg rating: 3.33/5
HEDM (BraggNN)
Materials Science • Localization accuracy, Inference time
Avg rating: 3.17/5
4D-STEM
Materials Science • Classification accuracy, Throughput
Avg rating: 3.17/5
In-Situ High-Speed Computer Vision
High Energy Physics • Accuracy, FPS
Avg rating: 1.50/5
MLPerf HPC - Cosmoflow
High Energy Physics • Training time, Accuracy, GPU utilization
Avg rating: 4.17/5
MLPerf HPC - DeepCAM
Climate & Earth Science • Training time, Accuracy, GPU utilization
Avg rating: 4.17/5
MLPerf HPC - Open Catalyst Project DimeNet++
Chemistry • Training time, Accuracy, GPU utilization
Avg rating: 4.17/5
MLPerf HPC - OpenFold
Biology & Medicine • Training time, Accuracy, GPU utilization
Avg rating: 4.17/5
MLCommons Science - CloudMask
Climate & Earth Science • MAE, Accuracy, Speedup vs simulation
Avg rating: 5.00/5
MLCommons Science - Earthquake
Climate & Earth Science • MAE, Accuracy, Speedup vs simulation
Avg rating: 5.00/5
MLCommons Science - Candle UNO
Biology & Medicine • MAE, Accuracy, Speedup vs simulation
Avg rating: 5.00/5
MLCommons Science - STEMDL
Materials Science • MAE, Accuracy, Speedup vs simulation
Avg rating: 5.00/5
LHC New Physics Dataset
High Energy Physics • ROC-AUC, Detection efficiency
Avg rating: 3.33/5
MLCommons Medical AI - Pancreas Segmentation (DFCI)
Biology & Medicine • ROC AUC, Accuracy, Fairness metrics
Avg rating: 4.33/5
MLCommons Medical AI - Brain Tumor Segmentation (BraTS)
Biology & Medicine • ROC AUC, Accuracy, Fairness metrics
Avg rating: 4.33/5
MLCommons Medical AI - Surgical Workflow Phase Recognition (SurgMLCube)
Biology & Medicine • ROC AUC, Accuracy, Fairness metrics
Avg rating: 4.33/5
CaloChallenge 2022
High Energy Physics • Histogram similarity, Classifier AUC, Generation latency
Avg rating: 4.50/5
PDEBench
Computational Science & AI, Climate & Earth Science, Mathematics • RMSE, boundary RMSE, Fourier RMSE
Avg rating: 4.50/5
Delta Squared-DFT
Chemistry, Materials Science • Mean Absolute Error (eV), Energy ranking accuracy
Avg rating: 3.83/5
Urban Data Layer (UDL) - PM2.5 Concentration Prediction
Climate & Earth Science • Task-specific accuracy or RMSE
Avg rating: 4.50/5
Urban Data Layer (UDL) - Built-up Area Classification
Climate & Earth Science • Task-specific accuracy or RMSE
Avg rating: 4.50/5
Urban Data Layer (UDL) - Administrative Boundaries Identification
Climate & Earth Science • Task-specific accuracy or RMSE
Avg rating: 4.50/5
Urban Data Layer (UDL) - El Nino Anomaly Detection
Climate & Earth Science • Task-specific accuracy or RMSE
Avg rating: 4.50/5
LLMs for Crop Science
Climate & Earth Science • Accuracy, F1 score
Avg rating: 4.67/5
DUNE
High Energy Physics • Detection efficiency, Latency
Avg rating: 2.83/5
MassSpecGym - De novo molecule generation
Chemistry • Structure accuracy, Retrieval precision, Simulation MSE
Avg rating: 3.75/5
MassSpecGym - Molecule Retrieval
Chemistry • Structure accuracy, Retrieval precision, Simulation MSE
Avg rating: 3.75/5
MassSpecGym - Spectrum Simulation
Chemistry • Structure accuracy, Retrieval precision, Simulation MSE
Avg rating: 3.75/5
HDR ML Anomaly Challenge - Gravitational Waves
High Energy Physics • ROC-AUC, Precision/Recall
Avg rating: 4.17/5
HDR ML Anomaly Challenge - Butterfly
Biology & Medicine • Classification accuracy, F1 score
Avg rating: 3.00/5
HDR ML Anomaly Challenge - Sea Level Rise
Climate & Earth Science • ROC-AUC, Precision/Recall
Avg rating: 3.83/5
Single Qubit Readout on QICK System
Computational Science & AI • Accuracy, Latency
Avg rating: 3.50/5
GPQA: A Graduate-Level Google-Proof Question and Answer Benchmark
Biology & Medicine, High Energy Physics, Chemistry • Accuracy
Avg rating: 3.67/5
SeafloorAI
Climate & Earth Science • Segmentation pixel accuracy, QA accuracy
Avg rating: 4.33/5
SeafloorGenAI
Climate & Earth Science • Segmentation pixel accuracy, QA accuracy
Avg rating: 4.33/5
SuperCon3D - Property Prediction
Materials Science • MAE (Tc), Validity of generated structures
Avg rating: 4.17/5
SuperCon3D - Inverse Crystal Structure Generation
Materials Science • MAE (Tc), Validity of generated structures
Avg rating: 4.17/5
GeSS - Track Pileup
High Energy Physics • Accuracy, RMSE, OOD robustness delta
Avg rating: 4.33/5
GeSS - Track Signal
High Energy Physics • Accuracy, RMSE, OOD robustness delta
Avg rating: 4.33/5
GeSS - DrugOOD
Biology & Medicine • Accuracy, RMSE, OOD robustness delta
Avg rating: 4.33/5
GeSS - QMOF
Materials Science • Accuracy, RMSE, OOD robustness delta
Avg rating: 4.33/5
Vocal Call Locator (VCL)
Biology & Medicine • Localization error (cm), Recall/Precision
Avg rating: 3.83/5
SPIQA (LLM)
Computational Science & AI • Accuracy, F1 score
Avg rating: 4.42/5
The Well
Biology & Medicine, Computational Science & AI, High Energy Physics • Dataset size, Domain breadth
Avg rating: 4.00/5