July 24, 2024

Present

Geoffrey Fox, Riccardo Balin, Shantenu Jha, Rashadul Kabir, Jeyan Thiyagalingam, Christine Kirkpatrick, Victor Lu, Wes Brewer, Armstrong Foundjem, Gregor von Laszewski, Piotr Luszczek, Juri Papay, Sujata Goswami, Tom Gibbs, Gregg Barrett, Luc Yu, Ali Hashmi

Tentative Agenda

Special Seminar: Speaker: Riccardo Balin MLCommons_SimAIBench_240723.pdf
Title: SimAI-Bench: A Performance Benchmarking Tool for Coupled Simulation and AI Workflows

There was no time for any of the following

Any New Members Introduction
AI Alliance
Linux Foundation AI & Data GenAI Commons Addressing Challenges in Open AI with LF AI & Data: Introducing the Model Openness Framework and Tool
Datasets linkage
Any comments on the HPC working group?
White Papers
Using Benchmarking Data to Inform Decisions Related to Machine Learning Resource Efficiency https://docs.google.com/document/d/1gOKA8BnlJnsTAELWFSmL7Fl7kJej_UrNH-FVXbZFxGI/edit?usp=sharing Submitted (Christine Kirkpatrick)
Benchmark Carpentry https://docs.google.com/document/d/15YIlAWOBA2_xjXkTnAZmaw003Jh4eqURVZYQHhdGYdQ/edit#heading=h.fa0u4qc1plw5
AI Readiness of MLCommons Science https://docs.google.com/document/d/1NbL-VdkrY9jzPxveOys2RCK8TdEJ7O5wgnxjAgzK-rE/edit?usp=sharing
Status of Benchmarks (OSMIBench)
Science Foundation Models
NASA Workshop https://sites.astro.caltech.edu/AIFM/speakers.html with links to talks discussing "IEEE SMC-IT/SCC 2024: Trustworthiness of Foundation Models and What They Generate".
Any Other Business

Special Seminar: Wednesday,July 24, 11.05 pm Eastern

Speaker: Riccardo Balin, MLCommons_SimAIBench_240723.pdf
Title: SimAI-Bench: A Performance Benchmarking Tool for Coupled Simulation and AI Workflows
Abstract: In situ AI/ML workflows, in which ML tasks are coupled to an ongoing simulation, are an attractive new paradigm for developing robust and predictive surrogate models for accelerating time to science by steering simulation ensembles and replacing expensive computations. In the world of high performance computing (HPC), these workflows require scalable and efficient solutions to integrate the rapidly evolving ecosystem of ML frameworks with traditional simulation codes by transferring large volumes of data between the various components. To address these issues, several libraries have recently emerged from groups in industry, academia, and national labs. In this talk, we introduce SimAI-Bench – a new tool for benchmarking and comparing the performance of different coupled simulation and AI/ML workflows on current and future HPC systems. In particular, the talk will focus on workflows for in situ training of graph neural network (GNN) surrogate models from ongoing computational fluid dynamic (CFD) simulations, requiring the transfer of training data between the two components. We will discuss how different open-source libraries enable such workflows and compare their data transfer performance and scaling efficiency on the Aurora supercomputer at the Argonne Leadership Computing Facility.

Notes on Presentation

Wes Brewer Wes Brewer - Senior Research Scientist, HPC and AI - Oak Ridge National Laboratory | LinkedIn introduced the speaker who is in a group at Argonne is similar to that of Wes at Oak Ridge.
Riccardo Balin https://www.linkedin.com/in/riccardo-balin-8a62a369/ is an Assistant Computational Scientist in the Data Services and Workflows team at the Argonne Leadership Computing Facility. He obtained a B.S./M.S. degree in Aerospace Engineering and a Ph.D. in Computational Fluid Dynamics and Turbulence Modeling in 2020 from the University of Colorado Boulder in 2016. He joined Argonne National Laboratory in 2021.
Riccardo motivated the importance of workflows combining simulations and machine learning. He gave two Argonne computational Fluid Dynamics examples PHASTA-CEED + CFDML GitHub - PHASTA/phasta: Parallel Hierarchic Adaptive Stabilized Transient Analysis of compressible and incompressible Navier Stokes equations and NekRS-ML + GNN This fork contains efforts to combine AI/ML with NekRS let by Argonne National Lab and other collaborators
They are initially using Polaris at ALCF but will move to Aurora
NekRS is their first benchmark used in SimAI-Bench: Benchmarking Coupled Workflows GitHub - argonne-lcf/SimAI-Bench: ALCF benchmarks for coupled simulation and AI workflows \ SimAI-Bench is needed as little understanding by users and computer support units of the issues combining workflows and AI. This benchmark will help design the next ALCF machine.
There is a good discussion of how workload is decomposed with simulation running on CPU’s and the deep learning network on GPUs using classic CFD domain decomposition. The approach is related to classification in paper by Wes Brewer and Shantenu Jha et al. AI-coupled HPC Workflow Applications, Middleware and Performance with “Motif #6: Adaptive Training”.
There are two sizes Large 3D and medium 2D
HPE SmartSim and DRAGON are used. Smartsim uses python multi-processing controlled by DRAGON. There is a detailed data movement analysis and consideration of two cases where CFD is always domain decomposed but there are two possibilities
Use PyTorch LibTorch directly to run the NekRS GNN locally with each CFD decomposed patch linking locally to associated GNN nodes.
Use “centralized” SmartRedis (RedisAI) to run the inferences. Note that as surrogate estimates turbulence the inference instances are naturally distributed in the same way as the CFD mesh. Note RedisAI does not support GNN and will not on Aurora. An MLP is used in initial runs rather than GNN so the distributed and centralized approaches can be compared
Slide 16 has performance showing that distributed colocated PyTorch will scale better than the centralized approach (as expected) and overhead is roughly 4% for inference and 1% for training
There were 20,000 trainable weights in GNN so the network size is modest.
Juri asked about a performance model and Gregor noted that inference could run in memory
Shantenu noted that DRAGON is a process manager
DRAGON supports a distributed Python Dictionary which is very attractive but not yet fully debugged.
Victor Lu suggested using profiles to understand overheads; Riccardo noted that they had not profiled the whole workflow yet.
Riccardo expects to publish a paper when DRAGON issues have been solved.