December 11, 2024

Present

Geoffrey Fox, Gregor von Laszewski, Juri Papay, Wes Brewer, Armstrong Foundjem, Piotr Luszczek, Victor Lu, Shirley Moore, Azza Ahmad, Gavin Mitchell Farrell, Marisa Ahmad, Riccardo Balin, Vijay Janapa Reddi, Gregg Barrett, Karen Bennet, Preetham Reddy, Steven Farrell

Tentative Agenda

Any New Members Introduction
Cataloging Existing Science Benchmarks
White Papers
Benchmark Carpentry https://docs.google.com/document/d/15YIlAWOBA2_xjXkTnAZmaw003Jh4eqURVZYQHhdGYdQ/edit#heading=h.fa0u4qc1plw5 https://www.overleaf.com/project/67585323797c7e764c254a84
MLCommons Science FAIR Concept Paper (AI Readiness) https://docs.google.com/document/d/1NbL-VdkrY9jzPxveOys2RCK8TdEJ7O5wgnxjAgzK-rE/edit?usp=sharing
Choosing Up-to-date Benchmarks (not discussed directly)

New Members

Karen Bennet https://www.linkedin.com/in/bennetkl/ is a dynamic experienced senior leader who is leading Enterprises to deploy secure, scalable, transparent, distributed software using next generation technologies (cloud, advance data analytics, Cyber Security, Blockchain, AI and Machine Learning.) She, also, provides training and mentoring on navigating the hype of technology and Women in Technology. She is involved in IEEE, ISO and Linux Foundation ML/ AI working groups.

Benchmark Carpentry

Gregor described move of Google docs to Overleaf https://www.overleaf.com/project/67585323797c7e764c254a84 for Benchmark Carpentry paper. This is describing the basics of benchmarking and also making MLCommons benchmarks accessible to a broad range of users. This aligns with the Science working group’s observation as to the educational value of (MLCommons) benchmarks.
Gregor is working on two figures.
In particular some difficulties in navigating MLCommons web site were described but several people did not experience difficulties. Also it was not easy to find co-ordinated information about all benchmarks and all papers on these benchmarks. Marisa and JUri will look into this
Marisa will also see if they can address a common list of benchmarks and we discussed what information is needed for each benchmark
Gregor mentioned a Journal special issue or summarizing benchmarks in a few pages on the MLCommons website
Gregor wanted a single resource/link for MLCommons benchmark citations
See the Victor Lu’s Comments subsection below for further remarks
Victor and Gregor discussed python and storage problems
Gregor noted that standalone machines like his PC and myDGX A100 workstation outperform HPC clustersdue to a better storage coupling.
Vijay asked about the goal of the carpentry paper, to which Gregor noted it described the best way to make benchmarks
Gregor noted that students found it hard to run some benchmarks and paper is also trying to address this
In this regard, paper was renamed from benchmark carpentry to Democratizing MLCommons Benchmarks (so all can run)
This implies the pedagogical interest on smaller machines and not so much the high end super big machines although these are essential in MLCommons as its such machines that are using all the worlds electricity and delivering our Foundation models
In refining this point, Juri reminded us of the difficulties with using PyTorch on Frontier
Gregor noted Shirley’s reproducibility comment that we cant reproduce MLCommons results as machine shared so we find lower performance than in MLCommons benchmark tables
Shirley noted the value of single GPU versions to study performance issues without having to run the whole giant job. Jurui agreed

Cataloging Existing Science Benchmarks

Geoffrey noted many sources already available for Science benchmarks
MLCommons Science Working Group
MLCommons HPC Working Group
FastML Inference benchmarks for HEP covering different interesting regions in latency-datarate space 230322 mlcommons science.pdf t https://arxiv.org/abs/2207.07958 https://github.com/fastmachinelearning/fastml-science
NSF HDR ML Challenge https://www.nsfhdr.org/mlchallenge
FAIR SBI Initiative including some of CaloChallenge packaged for easy use
CaloChallenge 2022: A Community Challenge for Fast Calorimeter Simulation http://arxiv.org/abs/2410.21611 https://arxiv.org/abs/2406.12898
Google WeatherBench2 https://sites.research.google/weatherbench/ https://arxiv.org/abs/2308.15560
SciML-Bench from RAL, UK https://github.com/stfc-sciml/sciml-bench
There are also Big Data benchmarks such as BigDataBench https://www.benchcouncil.org/BigDataBench/
The Science working group will not follow up this further

Any Other Business

Geoffrey apologized for starting the meeting 15 minutes late. A few members gave up waiting. I am very sorry
It was noted that David Kanter was very busy and did not often come to our meetings. However we had seen very good support from his representative Marisa.
It was noted that HPC working group status was still on hold Lessons from SC24 Meeting (not discussed directly)
Vijay suggested a separate paper on the “Big Picture for MLCommons”
Need to give a larger scope for scientific benchmarks
Generate a TPC Trillion Parameter Consortium checklist for global project
Juri agreed

Victor Lu’s Comments

1.) Bill of Material (BOM) : I believe BOM could be used in the context of reproducible science.
Referring to my AIBOM for reproducible science proposal: https://docs.google.com/document/d/1HdS_GxQvPA7y1ilspGmex-HN_cnjriR1nrpr_fnYzXY/edit?tab=t.0

I believe that most reproducible science projects store ontology data about research projects in graph databases. It is essential to establish a streamlined process to align the information captured in graph databases with metadata co-located with software/code, AI models, datasets, and related assets. A Bill of Materials (BOM) could serve as an ideal mechanism to facilitate this alignment, providing a robust solution for reproducible science projects.

The System Package Data Exchange (SPDX®) specification defines an open standard for communicating bill of materials (BOM) information for different topic areas.
Here is SPDX RDF ontology expressed in RDF/OWL/SHACL format and is published in online at SPDX 3.0.1 Model
https://spdx.github.io/spdx-spec/v3.0.1/annexes/rdf-model/

2.) Benchmark Carpentry paper. I believe understanding the "internals" of how hardware and software function in relation to scalability/response time bottlenecks may be crucial before designing benchmarks to measure related metrics effectively. For instance, regardless of how much CPU, GPU, HSM... memory can be available to a PyTorch-based AI workload, Python-specific limitations may still need to be addressed to achieve further performance improvements. Another example involves trade-offs in storage performance, where it is well-known that better performance can often be "achieved" by compromising on memory consistency and data integrity.