Skip to content

BaisBench (Biological AI Scientist Benchmark) - Question Answering

← Back to all benchmarks

Date: 2025-05-13

Name: BaisBench Biological AI Scientist Benchmark - Question Answering

Domain: Biology & Medicine

Focus: Omics-driven AI research tasks

Task Types: Cell type annotation, Multiple choice

Metrics: Annotation accuracy, QA accuracy

Models: LLM-based AI scientist agents

AI/ML Motif: Reasoning & Generalization

Resources

Benchmark: Visit
Datasets: Github

Keywords

Citation

  • Erpai Luo, Jinmeng Jia, Yifan Xiong, Xiangyu Li, Xiaobo Guo, Baoqi Yu, Lei Wei, and Xuegong Zhang. Benchmarking ai scientists in omics data-driven biological research. 2025. URL: https://arxiv.org/abs/2505.08341, arXiv:2505.08341.
@misc{luo2025benchmarkingaiscientistsomics,
  archiveprefix = {arXiv},
  author        = {Erpai Luo and Jinmeng Jia and Yifan Xiong and Xiangyu Li and Xiaobo Guo and Baoqi Yu and Lei Wei and Xuegong Zhang},
  eprint        = {2505.08341},
  primaryclass  = {cs.AI},
  title         = {Benchmarking AI scientists in omics data-driven biological research},
  url           = {https://arxiv.org/abs/2505.08341},
  year          = {2025}
}

Ratings

CategoryRating
Software
5.00
Instructions for environment setup available
Specification
4.00
Task clearly defined-cell type annotation and biological QA; input/output formats are well-described; system constraints are not quantified.
Dataset
5.00
Uses public scRNA-seq datasets linked in paper appendix; structured and accessible, though versioning and full metadata not formalized per FAIR standards.
Metrics
5.00
Includes precise and interpretable metrics (annotation and QA accuracy); directly aligned with task outputs and benchmarking goals.
Reference Solution
0.00
Model evaluations and LLM agent results discussed; however, no fully packaged, runnable baseline confirmed yet.
Documentation
5.00
Dataset and paper accessible; IPYNB files for setup are available on the github repo.
Average rating: 4.00/5

Radar plot

BaisBench (Biological AI Scientist Benchmark) - Question Answering radar

Edit: edit this entry