Skip to content

February 9, 2022

February 9, 2022

Present

Tony Hey, Geoffrey Fox, Gregor von Laszewski, Hai ah Nam, Juri Papay, Gregg Barrett, Farzana Yasmin Ahmad, Steve Farrell
Apologies: Jeyan Thiyagalingam, Arjun Shankar, Aristeidis Tsaris,

Tentative Agenda

  • New member introductions (None)
  • Update on the status of MLCommons compliance, what has been achieved, and what needs to be done.
  • Portability of benchmarks. How much effort is involved in deploying a benchmark?
  • Discussion of whether these benchmarks are portable and where we intend to run them.
  • Clarification on what each benchmark is going to measure. (Not Discussed)
  • AOB

MLCommons Compliance

Portability of benchmarks

  • This topic took the rest of the session with Juri describing his difficulty in running benchmarks on multiple machines with a horror story of how Horovod took 4 days to install. Horovod is needed to run on multi-node (each presumably multi-GPU) systems; running on single node, multi-GPU systems is supported without new software. Juri looked at STEMDL and Cloudmask
  • Gregg Barrett noted that it sounds like we need to do an updated paper on the the Technical Debt of Machine Learning.
  • The importance of using local disks to get good performance was stressed
  • Gregg Barrett noted the the upcoming MLC Storage benchmarks with storage WG MLCommons Storage Working Group Description
  • Steve Farrell pointed oit that submitters used containers (NERSC uses Shifter) and often partnered with NVIDIA who knew how to install all this software.
  • Containers are not typically re-uable except from specification as systems support different software versions
  • Security issues make effort harder
  • MLCube was discussed but didn’t seem to address this issue
  • Jax, Pytorch, Horovod, Tensorflow mesh were highlighted