April 2, 2025

Present

Azza Ahmad, Datta Nimmaturi, Fred Sala, Gary Mazzaferro, Gregg Barrett, Geoffrey Fox, Gregor von Laszewski, Howard Pritchard, Javier Toledo, Jihao Shi, Juri Papay, Matt Sinclair, Murali Emani, Nhan Tran, Philip Harris, Satoshi Iwata, Shantenu Jha, Shirley Moore, Tom Gibbs, Victor Lu

Apologies

Christine Kirkpatrick

Tentative Agenda

Any New Members Introduction
Presentation by Frederic Sala on TPBench Public Problems and Model Solutions – TP Bench (https://pages.cs.wisc.edu/\~fredsala/) will cover:
New meeting times. Discussion of Doodle polls. Preferred time for USA-Asia is 9.05 pm eastern
White Papers
Please take a look at the status and locations in the minutes of the February 19 and March 19 meetings.
Continuing discussion of the catalog of Science benchmarks based on https://docs.google.com/spreadsheets/d/1Ysk32dqkgdGfDW0rFaCpc8o1Cp6uhtJqbDFAIlhfb9o/edit?usp=sharing
Note benchmark summaries from ChatGPT and Gemini Deep Research
Any Other Busines

Google Meet Notes

Full notes are at Copy of MLC Science WG - 2025/04/02 07:56 PDT - Notes by Gemini.
Summary of these is:
Time zone confusion related to daylight saving time changes.
A humorous discussion about purchasing Jack Daniels whiskey due to a potential trade war.
Introductions of new members, Fred Sala and Datta Nimmaturi.
A presentation by Fred Sala on TPBench, a benchmark for AI performance in theoretical physics. This included discussions on reasoning models, data contamination, benchmark design, results, and future directions.
Questions and discussions related to TPBench, including data set creation, community contributions, training infrastructure, and scaling laws.
Meeting time and scheduling for future meetings.
Plans for future presentations, including one from the Aurora GPT team.
The main focus of the meeting was Fred Sala's presentation on TPBench and the related discussions about AI benchmarking in theoretical physics.

New Members

Datta Nimmaturi Datta Nimmaturi - Machine Learning Engineer 3 - Nutanix | LinkedIn AI/ML Engineer at Nutanix and currently research engineering LLMs
Frederic Sala Frederic Sala - Assistant Professor, Computer Sciences Department - University of Wisconsin-Madison | LinkedIn is an assistant professor in the Computer Sciences department at the University of Wisconsin-Madison, where I study the fundamentals of data-driven systems. He is also a research scientist at Snorkel, where they are building a data-first approach to AI.

Presentation on TPBench

Fred Sala’s wonderful talk was recorded https://drive.google.com/file/d/1fCeEj5bCIl6Hm86CSOO_9MzcT4EFeWGp/view?usp=sharing
The current state of "reasoning" models, how they work, what they've been evaluated on, and their known weak spots,
How we built TPBench and what new items it reveals about reasoning models,
The implications and where we think such reasoning models will fit within AI for science in the near and medium term.
There is an Arxiv paper [2502.15815] Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics on this.
The Google Meet notes linked above have an accurate summary of the talk.
We will try to continue theme with Frank Capello arranging a presentation from the AuroraGPT evaluation team.

New Meeting Time

After Doodle poll, we suggested every other Tuesday at 9 pm starting April 22, meeting every two weeks. We passed this idea to David Kanter to approve.

Scientific benchmarks and challenges - follow up discussion

Need to follow up with MLComons How to fix table. Verify with Harshat which column we identified are not properly defined in benchmark.

Catalog of Science benchmarks

MLCommons Science/HPC Benchmarks Overview

White Papers

Gregor has opportunity to create special collection with Frontiers in High Performance computing. Should we proceed with this, what is the topic.
Gregor asked for clarification what documents are needed
He assumes call that can be published
Maybe: High Performance COmputing and Machine Learning Benchmarks …
Paper 1: The Benchmark carpentry white paper https://www.overleaf.com/9828764221czxzxxcxmcrr#1f1c84
Meeting with Armstrong: Focus on Energy
Meeting with Victor: More concrete information requested. E.g. he likes to focus on complexity theory. Discussion took place that most HPC benchmarks started or include complexity theory, so it is important that he articulates not only what is done, but projects concrete example so we understand how this is different from regular activities.
Matt Sinclair: extend the classification. How about adding a 3rd picture and see if we should merge them or keep them separate?
Motivation for repeatable benchmarks: What we said has been validated by UFL (this is small benchmark of traffic camera analysis): Has similar benchmark characteristics as long running earthquauke prediction. UFL and UVA A100 are similarly “performing”