February 23, 2022
February 23, 2022
Present
Tony Hey, Geoffrey Fox, Gregor von Laszewski, Juri Papay, Gregg Barrett, Farzana Yasmin Ahmad, Arjun Shankar, Aristeidis Tsaris, Junqi Yin, Murali Emani, Cade Brown, Piotr Luszczek
Apologies: Jeyan Thiyagalingam, Christine Kirkpatrick
Tentative Agenda
- New member introductions
- Special Presentation "FAIR Metadata and Launching Platform for Surrogates" by University of Tennessee, Cade Brown and Piotr Luszczek "
- Continuation of discussion of portability of benchmarks. How much effort is involved in deploying a benchmark?
- Clarification on what each benchmark is going to measure.(NOT DONE)
- AOB
New Member Introductions
Cade Brown and Piotr Luszczek from the University of Tennessee introduced themselves. They work in the Innovative Computing Laboratory led by Jack Dongarra at the University of Tennessee. Cade Brownis a Junior at UTK.
Presentation on FAIR Metadata and Launching Platform for Surrogates
- This work is part of the SBI Surrogate Benchmark Initiative SBI-FAIR February 14 2022 Meeting involving UTK, Argonne National Laboratory, Rutgers, Indiana University, and the University of Virginia.
- Cade Brown gave a wonderful presentation CadeBrown-FAIR-Platform-Surrogates with goals:
- Public website for browsing ML models, datasets, and performance runs
- Quick and easy installation instructions for models:
- Search website for the model, based on keywords, authors, etc
- Click the “install” button, which downloads the .sh or .py file
- Run MODEL.py setup/run/bench
- Ideally, should work everywhere (HPC/supercomputer, desktop, laptop, CUDA/HIP/...), but some models may be framework-specific
- Cade noted related links
- https://neptune.ai/ - paid service with similar goals
- https://modelzoo.co/ - aggregator of ML models, good to datamine
- https://pypi.org/project/python-firebase/
- https://firebase.google.com/docs/ml/manage-hosted-models
- https://docs.google.com/document/d/1Aw3cYBEEf34A7A_UbDAOVVSjF28oM4C4o5RGfGRjDXI
- Cade noted they looked at CloudMask and STEMDL in our benchmarks
- Cade described the choice of JSON with perhaps HDF5 used for large datasets for efficiency.
- Cade discussed tradeoffs in the use of Firebase – discussion prompted by earlier SBI project meeting.
- Cade noted the use of Tensorboard to visualize weight and loss as part of reporting capabilities
- Cade asked for user input so they can adapt
- Gregg gave a link to Christine’s presentation https://docs.google.com/presentation/d/1iYOLIYBOBF0HraZaWfIyZbePuoQs8lAQ_6-uVwRZZKw/edit#slide=id.g10c9d08e4bc_1_138
- Murali: This is really interesting work. We have an ongoing project with a focus of HPC ML performance datasets and models. https://hpc-fair.github.io/
- Gregor noted his work on subtle issues like clock speed changes and overclocking distorting performance. He has useful tools
- Cade and Piotr noted the use of hardware performance counters to specify base system performance
- Arjun and Gregg congratulated the first users of our benchmarks. Arjun liked separate models and datasets and recommended URL’s as universal names.
- Gregor noted work of fast.ai
- SciMLBench SciMLBench_MLCommons_Science.pdf and DLHub were suggested as related work
Discussion of portability of benchmarks
- This continued the theme of the previous topic and we agreed that we must specify two separate artifacts
- FAIR metadata links to performance results, data, and reference models
- A set of containers supporting execution. These containers will have the same or similar specification files but different instances on each hardware due to version inconsistencies and rules that systems to container supported and systems installed. These containers will support PyTorch, Tensorflow, Horovod, etc.
- Note Horovod tends to have the best performance due to its choice of MPIAllReduce
- Gregor noted for example that compiled Python 10% faster than standard at the University of Virginia
- Porting between containers can be performed on a machine that supports multiple versions (Docker, Singularity, Shifter)
- Target hardware could be ThetaGPU (Argonne), Summit(ORNL), Pearl (RAL) AWS (supports Kubernetes), TACC/SDSC, NERSC