Skip to content

SciCode

← Back to all benchmarks

Date: 2024-07-18

Name: SciCode

Domain: Computational Science & AI

Focus: Scientific code generation and problem solving

Task Types: Coding

Metrics: Solve rate (%)

Models: Claude3.5-Sonnet

AI/ML Motif: Generative

Resources

Benchmark: Visit

Keywords

Citation

  • Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huerta, and Hao Peng. Scicode: a research coding benchmark curated by scientists. 2024. URL: https://arxiv.org/abs/2407.13168, arXiv:2407.13168.
@misc{tian2024scicoderesearchcodingbenchmark,
  archiveprefix = {arXiv},
  author        = {Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Yanyu Xiong and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufeng Du and Tianhua Tao and Ofir Press and Jamie Callan and Eliu Huerta and Hao Peng},
  eprint        = {2407.13168},
  primaryclass  = {cs.AI},
  title         = {SciCode: A Research Coding Benchmark Curated by Scientists},
  url           = {https://arxiv.org/abs/2407.13168},
  year          = {2024}
}

Ratings

CategoryRating
Software
5.00
Code to run exists on github repo
Specification
4.00
Expected outputs and broad types of inputs stated. Few details on output grading. No HW constraints.
Dataset
5.00
Dataset meets all FAIR principles, test and validation splits are available (no train split)
Metrics
4.00
Metrics stated, grading guidelines are provided in repo (problems are pass/fail)
Reference Solution
5.00
Code to evaluate is available and well documented. Baseline models include closed and open weight models
Documentation
4.00
Paper containing all needed info except for evlauation criteria
Average rating: 4.50/5

Radar plot

SciCode radar

Edit: edit this entry