Skip to content

January 22, 2025

January 22, 2025

Present

Andrew Naylor, Andy Cheng, Armstrong Foundjem, Azza Ahmad, Ben Hawks, Briana Cervantes, Christine Kirkpatrick, Claus Weiland, Datta Nimmaturi, David Kanter, Gary Mazzaferro, Geoffrey Fox, Gregg Barrett, Gregor von Laszewski, Gyuri Papay, Howard Pritchard, Hussain Ather, Javier Toledo, Jennifer Ngadiuba, Kevin Coakley, Lee S, Marco Colombo, Marisa Ahmad, Matt Sinclair, Mia Liu, Murali Emani, Nhan Tran, Piotr Luszczek, Riccardo Balin, Rutwik Jain, Ryan Kastner, Scott Brown, Shirley Moore, Shivaram Venkataraman, Victor Lu, Vijay Janapa Reddi, Wes Brewer

Apologies

Philip Harris

Tentative Agenda

New Members

Presentation

  • The talk SC22_PM_Variability_v6.pdf was excellently presented by Rutwik Jain and very clearly presented findings on GPU variability.
  • The issue of GPU and CPU performance variability had been discussed in the context of our Benchmark Carpentry white paper under preparation and Matt Sinclair had agreed to contribute a section of the white paper on this topic
  • In the discussion it was noted that memory intensive problems gave less variability than core based applications. Also as parallel computing runs at the speed of slowest processor the impact on multi-GPU jobs was particularly significant.
  • It was noted that this work allowed TACC to identify and replace slow GPUs. The Wisconsin team also designed a scheduler that assign the slow GPUs to the tasks were their low performance was least important.
  • Sinclair has been interacting with the Power working group of MLCommons..
  • SC24 paper https://arxiv.org/pdf/2408.11919 “PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters” presents a model and a scheduler.
  • Gregor stressed need to look in detail at both rack and position in rack of GPUs. He also noted that multi-user environments could be important even if GPUs not shared.
  • Gregor noted his papers Power-aware scheduling of virtual machines in dvfs-enabled clusters and Towards Thermal Aware Workload Scheduling in a Data Center which are both well cited.
The following weren’t covered in the talk, but are in the paper: Variability with cluster scale Variability with different GPU vendors (AMD/NVIDIA) Effect of varying GPU Power Limit Comparing single-GPU ResNetvs multi-GPU ResNet
1. How much performance variation is there across GPUs? 9% for SGEMM with outliers 1.5x slower than median
2. Do GPU physical metrics vary too? Yes. They also vary
3. How is variability affected by cluster parameters Consistent performance variability across clusters
4. Is variability consistent over time? Yes
5. Is variability application-dependent? Compute-intensive applications see more performance variability than Memory-intensive ones

This table summarizes presentation (slide 60)