May 1, 2024

Present

Geoffrey Fox, Juri Papay, Gregor von Laszewski, Gregg Barrett, Piotr Luszczek, Victor Lu, Tom Gibbs, Christine Kirkpatrick, Elie Alhajjar, Murali Emani, Yuhan Douglas Rao, Armstrong Foundjem

Tentative Agenda

Any New Members Introduction
Status of Benchmarks
White Papers
Using Benchmarking Data to Inform Decisions Related to Machine Learning Resource Efficiency https://docs.google.com/document/d/1gOKA8BnlJnsTAELWFSmL7Fl7kJej_UrNH-FVXbZFxGI/edit?usp=sharing Submitted (Christine Kirkpatrick)
Benchmark Carpentry https://docs.google.com/document/d/15YIlAWOBA2_xjXkTnAZmaw003Jh4eqURVZYQHhdGYdQ/edit#heading=h.fa0u4qc1plw5
AI Readiness of MLCommons Science https://docs.google.com/document/d/1NbL-VdkrY9jzPxveOys2RCK8TdEJ7O5wgnxjAgzK-rE/edit?usp=sharing
Science Foundation Models
Any Other Business

Gregor noted a meeting with ORNL and HPE working towards paper for Frontier submission, must include AI and workflow; suggesting to use OSMI and MOM6 as applications
OSMI discussed earlier in minutes; MOM6 is Modular Ocean Model (MOM) – Geophysical Fluid Dynamics Laboratory GitHub - mom-ocean/MOM6: Modular Ocean Model
OSMI has been converted to use PyTorch and HPE Smartsim
Tensorflow on Frontier slow, observation no one seems to use Tensorflow these days (Wes converted OSMI to PyTorch)
Port and run OSMI, and MOM6 on other machines than Rivanna, showcase the workflow component of cloudmesh to do so

MLCommons FAIR Implementation Profile (FIP)

Christine Kirkpatrick presented the FAIR Implementation profile for MLCommons https://docs.google.com/presentation/d/1xaGMzvWWkp3e1beo7Lv5kqirG23lbA6sGNWcgE4z1TI/edit?usp=sharing or https://docs.google.com/presentation/d/1oZzZayHibGBEQ9641ipnyyjT0I9Orpp7/edit?usp=sharing\&ouid=100475830980223085873\&rtpof=true\&sd=true
This was received with great interest and Gregg Barrett suggested that it needs to be placed into one of the MLC newsletters in order for folks to adopt it
There was discussion of the granularity at which FIPs should be designed; Christine indicated that this should be as large as possible i.e. MLCommons not MLCommons Science
MLCommons has many areas where FIP could not be quantified suggested we need an improved process.
We will bring this to David Kanter’s attention
There was no time to discuss papers but this material could be included in one of them.

Inference Performance for Generative AI

Farzana Yasmin Ahmad \<fa7sa@virginia.edu> and Vanamala Venkataswamy, \<vv3xu@virginia.edu> have compared 32 bit and 16 bit Floating point performance on Calodiffusion https://docs.google.com/presentation/d/1Vb0-ZiWYTkmXb7-vCoXVhkWVWYDlgpPhr8Uf9VTE-S0/edit?usp=sharing.
The quality of generated events is similar but 100 events take 0,12 seconds in FP16 and 0.22 seconds in FP32.
This could mean that gain comes from data transfer speed not compute performance (Piotr)
Some operations are memory-bandwidth-bound, which means the on-chip memory bandwidth determines how much time is spent computing the output. Storing operands and outputs of those ops in the bfloat16 format reduces the amount of data that must be transferred, improving speed. (Gregor)
FP16 and BFloat16 on A100 have different performance. Half Precision Arithmetic: fp16 Versus bfloat16 – Nick Higham The BFloat16 has a larger exponent range than FP16 but on A100 lower performance NVIDIA Ampere Architecture In-Depth | NVIDIA Technical Blog
H100 was rated at 60 teraflops of dense FP64 compute per GPU. If B200 had similar scaling to the other formats, each dual-die GPU would have 150 teraflops. However, it looks like Nvidia is stepping back FP64 performance a bit, with 45 teraflops of FP64 per GPU. according to Nvidia’s next-gen AI GPU is 4X faster than Hopper: Blackwell B200 GPU delivers up to 20 petaflops of compute and other massive improvements | Tom's Hardware
Tom Gibbs noted
NSF supercomputer centers wil have H100 soon
FP64 uses a lot of transistors and NVIDIA cut back a bit here to put those transistors to better use.
DGEMM Double-precision, General Matrix-Matrix multiplication is no longer used much.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Llama-Bitnet | Training a 1.58 bit LLM | by Zain ul Abideen | Apr, 2024 | Medium
Tom will get colleague Steve Oberlin and
There will be a workshop at ISC in 2 weeks (colliding with our meeting) where Jeyan, Tom, Piotr and Tony Hey will discuss such issues.

May 1, 2024

May 1, 2024

Present

Tentative Agenda

OSMI and Related Benchmarks

MLCommons FAIR Implementation Profile (FIP)

Inference Performance for Generative AI