April 16, 2025

Present

Ali Hashmi, Armstrong Foundjem, Azza Ahmad, Carl Ehrett, Christine Kirkpatrick, Elizabeth Campolongo, Gary Mazzaferro, Gregg Barrett, Geoffrey Fox, Gregor von Laszewski, Hussain Ather, Iulia Ibanescu, Javier Toledo, Juri Papay, Krishna Gopal, Matt Sinclair, Murali Emani, Nhan Tran, Philip Harris, Piotr Luszczek, Rini Susan, Satoshi Iwata, Shirley Moore, Stefan Dvoretskii, Tues Day, Victor Lu

Tentative Agenda

Any New Members Introduction
Additional meeting times. April 22 9 pm April 23 11pm Eastern
Next Steps in Science LLM Evaluation following TPBench
White Papers
Continuing discussion of New Benchmarks and the catalog of Science benchmarks based on https://docs.google.com/spreadsheets/d/1Ysk32dqkgdGfDW0rFaCpc8o1Cp6uhtJqbDFAIlhfb9o/edit?usp=sharing
Note benchmark summaries from ChatGPT and Gemini Deep Research
Any Other Business

Google Meet Notes

Full notes are at Copy of MLC Science WG - 2025/04/16 07:58 PDT - Notes by Gemini
Summary of the notes is:
The meeting addressed industry concerns about the discontinuation of the ML Commons HPC working group, clarified by Murali Emani and Geoffrey Fox as due to lack of benchmark submissions, with the existing benchmarks remaining available.
New members—Carl Ehrett, Tues Day, Rini Susan, Iulia Ibanescu, Krishna Gopal—introduced themselves, and the group discussed benchmark categorization,
Gary discussed funding for a compliance project (seeking $500,000-$750,000),
Nhan and Matt discussed defining the benchmarking end goal (a categorized set, not a new submission round), and leveraging existing resources like openml.org.
Nhan Tran will organize an informal meeting to further categorize existing benchmarks, and
Gregor von Laszewski requested contributions to the benchmark carpentry paper.

New Members

Carl Ehrett is Director of Applied Machine Learning in Research Computing & Data at Clemson University. Carl Ehrett - Director of Applied Machine Learning - Clemson University | LinkedIn. Looking at AI on Palmetto About the Palmetto 2 Cluster | RCD Documentation
Tues Day is Director of Research, audio engineer, ML engineer, and safety researcher. introduced at New Members July 10 2024. Interested in security and Quantum. She founded Artifex Labs in Portland. https://www.linkedin.com/in/222tuesday/.
Rini Susan V S https://www.linkedin.com/in/rinisusan/ Rini Susan V S – Medium works for Red Oak Technologies on AI performance for applications. Works with Apple
Iulia Ibanescu https://www.linkedin.com/in/iulia-ibanescu/ is an experienced Senior HPC Engineer at Boston Limited Boston Limited | The Org since May 2021, with a strong background in high-performance computing and atmospheric science. Works across world including South Africa. Interacts with MLCommons Storage WG.
Krishna Gopal https://www.linkedin.com/in/kgopal/ is a Staff Engineer at Celestica in Chennai. Has explored MLCommons benchmarks

General discussion

Matt pointed out that industry was still interested in HPC Benchmarks. Murali will follow up.
Catalog of Science benchmarks
MLCommons Science/HPC Benchmarks Overview
We need to make a good web site from this
Following up on TPBench
[2311.12022] GPQA: A Graduate-Level Google-Proof Q\&A Benchmark
[2501.14249] Humanity's Last Exam
Nhan tried but could not establish any contact with “Papers with Code”
European organization OpenML seems promising and Nhan after meeting contacted them
About OpenML
OpenML Task collections
Contacts for members
Armstrong foundjem@ieee.org
Tues Day has a cool email Tuesday@artifex.fun
Nhan Tran ntran@fnal.gov

Status of White Papers

See papers 1, 2, 3 https://docs.google.com/document/d/167m7FK6-Ud4M5gXta5cIc1hKqaRHkk2B1GyKasdeQLc/edit?pli=1\&tab=t.0#heading=h.b1jox6cj5tjq
Paper 1: The Benchmark carpentry white paper https://www.overleaf.com/9828764221czxzxxcxmcrr#1f1c84 is fully active and encourages participation. Contact if interested laszewski@gmail.com
Gregg thought Gregor was being very flexible and accommodating.
Challenges in understanding and reproducing MLCommons benchmarks were noted
Also students are not properly trained to prepare benchmarks.
Benchmarking as a science is lost
Paper 2: Using Benchmarking Data to Inform Decisions Related to Machine
Learning Resource Efficiency, Kirkpatrick, Christine, Barrett, Gregg,
Brewer, Wesley, Christopher, Julianne, Dutra, Inês, Emani, Murali,
Luszczek, Piotr, Shankar, Mallikarjun, von Laszewski, Gregor, Papay,
Juri, Fox, Geoffrey, https://doi.org/10.5281/zenodo.15022149
Jeyan and Gregor are working on an improved version
https://docs.google.com/document/d/1aPRYM7_jdwWgmd4_Fsjtf3oH0ZcAAljQ/edit#heading=h.vmlqcpehwg0k
This version has been converted to overleaf: https://www.overleaf.com/read/gbvhrjmqmskm#bec8e2
Christine noted Is stochastic thermodynamics the key to understanding the energy costs of computation? | PNAS
Paper 3 was aimed at a special publication opportunity but rejected. It is probably forever frozen. We need more information from Christine about this one.
https://docs.google.com/document/d/1NbL-VdkrY9jzPxveOys2RCK8TdEJ7O5wgnxjAgzK-rE/edit?tab=t.0

Science Working Group Projects

Science Benchmarks
HPC Benchmarks
New White Papers
Interesting Presentations at meetings
Produce Taxonomy of Existing Sience AI Codes/Benchmarks
Previous papers tutorials and Birds of a Feather at Conferences
Teach an LLM to manage lists of benchmarks and codes