BaisBench (Biological AI Scientist Benchmark) - Question Answering

← Back to all benchmarks

Date: 2025-05-13

Name: BaisBench Biological AI Scientist Benchmark - Question Answering

Domain: Biology & Medicine

Focus: Omics-driven AI research tasks

Task Types: Cell type annotation, Multiple choice

Metrics: Annotation accuracy, QA accuracy

Models: LLM-based AI scientist agents

AI/ML Motif: Reasoning & Generalization

Resources

Benchmark: Visit

Datasets: Github

Keywords

single-cell annotation biological QA autonomous discovery

Citation

Erpai Luo, Jinmeng Jia, Yifan Xiong, Xiangyu Li, Xiaobo Guo, Baoqi Yu, Lei Wei, and Xuegong Zhang. Benchmarking ai scientists in omics data-driven biological research. 2025. URL: https://arxiv.org/abs/2505.08341, arXiv:2505.08341.

@misc{luo2025benchmarkingaiscientistsomics,
  archiveprefix = {arXiv},
  author        = {Erpai Luo and Jinmeng Jia and Yifan Xiong and Xiangyu Li and Xiaobo Guo and Baoqi Yu and Lei Wei and Xuegong Zhang},
  eprint        = {2505.08341},
  primaryclass  = {cs.AI},
  title         = {Benchmarking AI scientists in omics data-driven biological research},
  url           = {https://arxiv.org/abs/2505.08341},
  year          = {2025}
}

Ratings

CategoryRating

Software

5.00

Instructions for environment setup available

Specification

4.00

Task clearly defined-cell type annotation and biological QA; input/output formats are well-described; system constraints are not quantified.

Dataset

5.00

Uses public scRNA-seq datasets linked in paper appendix; structured and accessible, though versioning and full metadata not formalized per FAIR standards.

Metrics

5.00

Includes precise and interpretable metrics (annotation and QA accuracy); directly aligned with task outputs and benchmarking goals.

Reference Solution

0.00

Model evaluations and LLM agent results discussed; however, no fully packaged, runnable baseline confirmed yet.

Documentation

5.00

Dataset and paper accessible; IPYNB files for setup are available on the github repo.

Average rating: 4.00/5

Radar plot

$BaisBench (Biological AI Scientist Benchmark) - Question Answering radar$

Edit: edit this entry