SciCode

← Back to all benchmarks

Date: 2024-07-18

Name: SciCode

Domain: Computational Science & AI

Focus: Scientific code generation and problem solving

Task Types: Coding

Metrics: Solve rate (%)

Models: Claude3.5-Sonnet

AI/ML Motif: Generative

Resources

Benchmark: Visit

Datasets: SciCode on Huggingface

Results: SciCode Learderboard

Keywords

code synthesis scientific computing programming benchmark

Citation

Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huerta, and Hao Peng. Scicode: a research coding benchmark curated by scientists. 2024. URL: https://arxiv.org/abs/2407.13168, arXiv:2407.13168.

@misc{tian2024scicoderesearchcodingbenchmark,
  archiveprefix = {arXiv},
  author        = {Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Yanyu Xiong and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufeng Du and Tianhua Tao and Ofir Press and Jamie Callan and Eliu Huerta and Hao Peng},
  eprint        = {2407.13168},
  primaryclass  = {cs.AI},
  title         = {SciCode: A Research Coding Benchmark Curated by Scientists},
  url           = {https://arxiv.org/abs/2407.13168},
  year          = {2024}
}

Ratings

CategoryRating

Software

5.00

Code to run exists on github repo

Specification

4.00

Expected outputs and broad types of inputs stated. Few details on output grading. No HW constraints.

Dataset

5.00

Dataset meets all FAIR principles, test and validation splits are available (no train split)

Metrics

4.00

Metrics stated, grading guidelines are provided in repo (problems are pass/fail)

Reference Solution

5.00

Code to evaluate is available and well documented. Baseline models include closed and open weight models

Documentation

4.00

Paper containing all needed info except for evlauation criteria

Average rating: 4.50/5

Radar plot

$SciCode radar$

Edit: edit this entry