Skip to content

November 27, 2024

November 27, 2024

Present

Geoffrey Fox, Gregor von Laszewski, Juri Papay, Gary Mazzaferro, Wes Brewer, Armstrong Foundjem, Piotr Luszczek, Victor Lu, Shirley Moore, Claus Weiland, Azza Ahmad, Murali Emani, Christine Kirkpatrick, Gavin Mitchell Farrell, Howard Pritchard, Noah Nisbet, Indra Priyadarsini

Tentative Agenda

New Members

Any Other Business

MLCommons Science FAIR Concept Paper (AI Readiness)

  • Christine Kirkpatrick noted her NSF RCN https://www.farr-rcn.org/
  • Christine noted that the original paper had comments from Tom Gibbs
  • There was a discussion of the quality and use of MLPerf HPC and Training benchmarks
  • Christine noted that attendee at SC24MLPerf BOF expressed regrets on the removal of Resnet from MLPerf
  • Resnet is a good benchmark to study as lots of results recorded for it
  • Shirley noted that the results not reproducible and she got results that were a factor of 2 slower than listed values for DeepCam
  • Juri said this was typical and will share his measurements
  • Christine said Juri wanted results to be reproducible at the 10% level
  • Gregor noted his PC was twice as fast as server A100
  • Performance depends on use of Exclusive mode (or not) is not part of the metadata for MLCommons
  • Performance depends on Queuing policy, which changes
  • Staging data costs are uncertain and one needs to set up data properly before running benchmark
  • Gregor noted that UVA charges for storage and the GPUs are typically set up to run well from shared storage
  • Geoffrey/Gregor’s DGX machine at UVA is 4x faster than most at UVA due to custom data set up
  • We discussed the use of ML to improve ML
  • Juri noted that RAL tried to collect log files and use to optimize but this project didn’t complete as one couldn’t get log files
  • Christine noted that there were too many free text fields in MLPerf Logs.
  • Gregor noted Sabath from UTK Piotr https://sbi-fair.github.io/docs/software/sabath/ enhanced a little at UVA.
  • Extend MLPerf metadata model as done in Sabath
  • Juri wants to characterize benchmark so one can reason about it
  • Victor brought up the general issue of reproducibility which is discussed later as a separate topic
  • Christine noted AI reproducibility at NeurIPS and AAAI
  • She will reframe paper based on these discussions with her new outline given below

AI Readiness Paper: New Outline from Christine Kirkpatrick

  1. Introduction
  2. Motivation
    1. Reproducibility of benchmark results - benchmarks aren’t capturing the relevant metadata needed
    2. The current state of practice is that it is very difficult to reproduce benchmarks and the numbers can’t be trusted.
    3. Desire to use ML to improve ML, difficult to operationalize without good data (logs)
  3. Background
  4. MLCommons Science WG & HPC
  5. AI reproducibility
    1. what definition we’re working with, goal
    2. Current practices/guidance for documenting
  6. Limitations of benchmarks & related system understanding
  7. Missing competencies of today’s computational and data scientists to understand performance impacts of running benchmarks (ex: where /tmp is)
  8. No discussion of exclusive mode or other issues that greatly impact performance.
  9. Changes that would improve MLCommons benchmark analysis / FAIRifying MLCommons benchmarks
  10. In-depth analysis of Deepcam Benchmark
  11. Metadata model for benchmark logs (aspirational goal)
  12. Future Work
  13. Conclusion

Benchmark Carpentry Paper

  • Gary noted that Application benchmarks consist of five (5) critical elements:
  • algorithm performance
  • infrastructure performance
  • workflow (training/production)
  • infrastructure configuration
  • data size and composition
  • Gregor noted that we should refer to FAIR material in Carpentry
  • Currently 2 pages
  • Goal is to Democratizing ML through benchmarks
  • Mlperf target is on those who run systems
  • Users arent familiar with benchmarking
  • Geoffrey suggested developing taxonomies of application requirements in various categories such as
  • MNIST “Hello World”
  • LLM
  • Time series
  • Image-based Science applications
  • Graph-based as in molecules and social networks
  • We discussed that Github repositories suffer from incomplete metadata
  • Maybe science applications are more diverse than commercial and so need more benchmarks
  • Christine had a table of benchmarks in original version of papers. She will find
  • Gary M noted that size of datasets can be important due to hardware infrastructure and backend setup
  • e.g. single gpu very different to multiple gpu.

Victor Lu’s Reproducibility Comments

  • My comment about topic that Christine brought up: "What data practices do you wish the ML/HPC/CS world would adopt?"1.) to establish a "data practice," it is essential to capture metadata that describes the AI-driven scientific research process. AIBOM (AI bill of material) is one type of metadata document. I created following document as part of the CISA AIBOM tiger team effort to capture AIBOM properly for research reproducibility. https://docs.google.com/document/d/1HdS_GxQvPA7y1ilspGmex-HN_cnjriR1nrpr_fnYzXY/edit?tab=t.0

  • More: If you look at my document, the main reference website is: http://www.practicereproducibleresearch.org/

  • I believe there are some interesting points made by this research website.

  • Goal/scope: Clarify the goals and scope of the MLCommons Science Working Group - why do we work on the AI Readiness white paper and the Benchmark Carpentry white paper. Can we achieve the following goals?

  • http://www.practicereproducibleresearch.org/core-chapters/5-lessons.html
  • Computational reproducibility and transparency, which emphasizes code documentation.
  • Scientific reproducibility and transparency, which emphasizes documentation of scientific decisions and accessibility of data.
  • Computational correctness and evidence, which emphasizes automated testing and validation
  • Statistical reproducibility, which emphasizes transparency of data analysis the logical path to scientific conclusions.
  • Gregor noted that he would add “Numerical reproducability vs computational reproducability” and that sensitivity of AI to precision byte size was important
  • Wes noted an interesting reference on numerical reproducibility: https://arxiv.org/pdf/2408.05148
  • Template: While it may fall outside the scope of MLCommons, defining a "Basic Reproducible Workflow Template" could significantly enhance reproducibility and transparency. The current template on the referred website is outdated and needs revision, highlighting the importance of establishing guiding "principles" rather than focusing solely on technical specifics. This approach ensures the template remains relevant despite technological advancements. http://www.practicereproducibleresearch.org/core-chapters/3-basic.html
  • Next Step: To drive fundamental change in how science is conducted, including benchmarking practices, it is essential to establish the right incentives—both financial and cultural. http://www.practicereproducibleresearch.org/core-chapters/6-future.html
  • "our view is that some of the most compelling opportunities are in how incentives - and the practice of science more generally - can be changed by groups such as funding agencies, journal editors and libraries."

  • Finally, "AI readiness" is a broad term that can vary significantly depending on the context. In industries where profit often takes precedence, its meaning may differ from that within the scientific community, where the focus is more on long-term benefits for humanity. This, in turn, will change the scope of our work in this science group.