June 26, 2024
June 26, 2024
Present
Geoffrey Fox, Juri Papay, Gregor von Laszewski, Gregg Barrett, , Wes Brewer, Victor Lu, Hector Hernandez Corzo, Javier Toledo, Armstrong Foundjem, Tom Gibbs, Ali Hashmi
Apologies
Christine Kirkpatrick
Tentative Agenda
- Any New Members Introduction
- Preparing for Community Meeting June 27, 2024. See https://docs.google.com/presentation/d/1mmtVqYEpzVwsC1GjKXEGJckBnFPax-H2inD4tRMkS_w/edit?usp=sharing and full set https://docs.google.com/presentation/d/107NxlXHMH2NKi7akmfLO9n6fG-wknhNKW2N9EPdtR3k/edit?usp=sharing
- Status of Papers
- Status of Benchmarks
- Science Foundation Models
- Any Other Business
New Member
- Ali Hashmi https://www.linkedin.com/in/ali-hashmi/ Senior Programmer/Healthcare Data Scientist at IBM Consulting - US Federal
Status of Benchmarks
- We started discussing OSMIBench where our work is led by Wes and gregor. There are deployment choices between SmartSim (See December 13, 2023 minutes and https://github.com/CrayLabs/SmartSim), SimAI-Bench from Argonne (presented at PASC PASC24_presentation.pdf) and Cloudmesh Experiment Executor (from Gregor)
- Difficulties with Tensorflow Serving
- Deployed with Docker or Singularity
- Hope to complete before the next meeting
- Wes suggested bringing up Frontier deployment difficulties at the Users meeting e.g. that each user must install PyTorch
- Wes noted AI-coupled HPC Workflow Applications, Middleware and Performance paper with Shantenu Jha. He gave a related presentation June 27. OSMI-Bench Brewer.pdf
- Gregor Wes and Juri discussed how best to package benchmarks
- Each application needs own directory/environment
- Customize environment for each benchmark
- HPE software is customized to their machine
- Difficult to avoid software version clashes. The DGX A100 workstation used by Javier is easier as more dedicated
- Systems cant install across heterogeneous targets needing different drivers
- NVIDIA can help here
- Juri OLMO on Frontier
- Vision transformer for weather
- Juri is running OLMO language model on Frontier
- And a Vision transformer for weather
Any Other Business
- Geoffrey noted that his group was looking at Hernandez’s system RWKV-TS for Hydrology time-series
- Javier wanted to know where MLCommons benchmarks were documented
- https://mlcommons.org/working-groups/research/science/
- https://github.com/mlcommons/science
- https://github.com/laszewsk/mlcommons
- We asked about H100 access; Tom Gibbs suggested TACC might be best
- Maybe NVIDIA Launchpad possible
- Juri gave a short presentation on flop counts for applications new_benchmarks_counting_flops.pptx
- The A100 performs very well
- He asked how NVIDIA got their Flop numbers; Tom Gibbs thought they were optimistic “never to exceed” numbers
- Compared AMD MI250 with A100 but there was some difficulties unless one uses all GPU’s on the node
- He was having problems using NVIDIA Grace-Hopper machines
- Wes noted INDUS Birgit Pfitzmann and [2405.10725] INDUS: Effective and Efficient Language Models for Scientific Applications from mainly NASA and IBM with a scientific LLM and benchmarks
- "We also created three new scientific benchmark datasets namely, CLIMATE-CHANGE NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields."
- NASA are operationalizing INDUS LLM across their Science Mission Directorate