April 30, 2025

Present

Aaron Goulden, Armstrong Foundjem, Azza Ahmad, Carl Ehrett, Christine Kirkpatrick, Geoffrey Fox, Gregor von Laszewski, Howard Pritchard, Iulia Ibanescu, Javier Toledo, Matt Sinclair, Murali Emani, Nhan Tran, Philip Harris, Piotr Luszczek, Satoshi Iwata, Shirley Moore, Victor Lu, Wenhui Zhang

Tentative Agenda

Any New Members Introduction
Additional meeting times. April 22 9 pm April 23 11pm Eastern
Next Steps in Science LLM Evaluation following TPBench
White Papers
Continuing discussion of New Benchmarks and the catalog of Science benchmarks based on https://docs.google.com/spreadsheets/d/1Ysk32dqkgdGfDW0rFaCpc8o1Cp6uhtJqbDFAIlhfb9o/edit?usp=sharing

New Member

Aaron Goulden is Founder and Technical Director at an AI startup, Delightful Data AI Inc, attended the University of Central Lancashire, is in Manchester Area, United Kingdom. See https://www.linkedin.com/in/aarong11/

Google Meet Notes

The Google notes are useless as the report says “A summary wasn't produced for this meeting because there wasn't enough conversation in a supported language.” A rather implausible excuse

White Papers

Gregor arranged to meet Nhan, Armstrong, Victor
Armstrong will present at our meeting two weeks from now May 14, 2025
Matt noted that he can work with Nhan, on adding Gregor's suggestion to the MLCarpentry Overleaf -- let me know if you need me

Discussion of Nhan’s Benchmark Analysis

Nhan presented from MLCommons Benchmarks - Taxonomy - 4/30/25
Victor noted “modern AI systems fail in four key areas: they lack awareness of the physical world, have limited memory and no continuous recall, are incapable of reasoning, and struggle with complex planning.” Yann LeCun Calls LLMs 'Token Generators' While Llama Hits a Billion Downloads
Matt suggested that for systems researchers we should work on developing a scheme to take the 100s of ML+Science/Science workloads and characterize them (e.g., compute vs. memory intensity) such that systems researchers wanting to run MLCommons Science workloads can run this representative subset and have confidence their behavior is representative of the larger set of workloads. This will make it more practical for users to pick up these workloads, since running 100s of them is likely not practical.