Skip to content

CURIE (Scientific Long-Context Understanding, Reasoning and Information Extraction)

← Back to all benchmarks

Date: 2024-04-02

Name: CURIE Scientific Long-Context Understanding, Reasoning and Information Extraction

Domain: Materials Science, High Energy Physics, Biology & Medicine, Chemistry, Climate & Earth Science

Focus: Long-context scientific reasoning

Task Types: Information extraction, Reasoning, Concept tracking, Aggregation, Algebraic manipulation, Multimodal comprehension

Metrics: Accuracy

Models: unkown

AI/ML Motif: Reasoning & Generalization

Resources

Benchmark: Visit

Keywords

Citation

  • Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Norgaard, Nayantara Mudur, Martyna Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian Rohr, Ekin Dogus Cubuk, Muratahan Aykol, Amil Merchant, Michael J. Statt, Dan Morris, Drew Purves, Elise Kleeman, Ruth Alcantara, Matthew Abraham, Muqthar Mohammad, Ean Phing VanLee, Chenfei Jiang, Elizabeth Dorfman, Eun-Ah Kim, Michael P Brenner, Viren Jain, Sameera Ponda, and Subhashini Venugopalan. Curie: evaluating llms on multitask scientific long context understanding and reasoning. 2025. URL: https://arxiv.org/abs/2503.13517, arXiv:2503.13517.
@misc{cui2025curieevaluatingllmsmultitask,
  title={CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning}, 
  author={Hao Cui and Zahra Shamsi and Gowoon Cheon and Xuejian Ma and Shutong Li and Maria Tikhanovskaya and Peter Norgaard and Nayantara Mudur and Martyna Plomecka and Paul Raccuglia and Yasaman Bahri and Victor V. Albert and Pranesh Srinivasan and Haining Pan and Philippe Faist and Brian Rohr and Ekin Dogus Cubuk and Muratahan Aykol and Amil Merchant and Michael J. Statt and Dan Morris and Drew Purves and Elise Kleeman and Ruth Alcantara and Matthew Abraham and Muqthar Mohammad and Ean Phing VanLee and Chenfei Jiang and Elizabeth Dorfman and Eun-Ah Kim and Michael P Brenner and Viren Jain and Sameera Ponda and Subhashini Venugopalan},
  year={2025},
  eprint={2503.13517},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2503.13517}, 
}

Ratings

CategoryRating
Software
4.00
Code is available, but not well documented
Specification
1.00
Explains types of problems in detail, but does not state exactly how to administer them.
Dataset
4.00
Dataset is available via Github, but hard to find
Metrics
5.00
Quantitiative metrics such as ROUGE-L and F1 used. Metrics are tailored to the specific problem.
Reference Solution
1.00
Exists, but is not open
Documentation
5.00
Associated paper explains all criteria
Average rating: 3.33/5

Radar plot

CURIE (Scientific Long-Context Understanding, Reasoning and Information Extraction) radar

Edit: edit this entry