March 6, 2024
March 6, 2024
Present
Geoffrey Fox, Juri(Gyuri) Papay, Feiyi Wang, Elie Alhajjar, Gregor von Laszewski, Gregg Barrett, Murali Emani
Apologies
Xavier Coubez, Christine Kirkpatrick, Vijay Janapa Reddi, Tom Gibbs, Rajat Shinde
Tentative Agenda
- Any New Members Introduction
- AI Alliance; the implications
- Science Foundation Models
- Updates on Papers and Projects
- Any Other Business
AI Alliance
- Geoffrey described interactions with the AI Alliance https://thealliance.ai/ which has a Foundation model working group. It is not clear if this will focus on LLMs or also include science data
- The Alliance is led by IBM and Meta and has many industry, academic and government members. Google, Amazon, Microsoft, NVIDIA and OpenAI are noticeably absent. MLCommons is a member
- There was an All Hands meeting Tuesday March 5 of the whole Alliance. There were no surprises and comments were pretty general. Geoffrey will keep in touch and monitor developments
Science Foundation Models
- Juri described the porting of OLMo - Open Language Model by AI2 from Allen AI, and Feiyi was interested in its performance.
- Murali reported that Llama was running well on Polaris | Argonne Leadership Computing Facility, which uses A100 from NVIDIA, compared to Intel GPUs on Aurora using the Intel OneAPI framework.
- Feiyi reported on FORGE: Pre-Training Open Foundation Models for Science | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis model looked at on Frontier in [2312.12705] Optimizing Distributed Training on Frontier for Large Language Models. The last paper is improved in a paper accepted to ISC.
- They note that performance was good, up to 3000 GPUs where it stalls. Problems occur on models that are > 100Billion in size
- Murali discussed [2402.15627] MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs from ByteDance
- Argonne and Oak Ridge are using Megatron from Microsoft Deepspeed for pretraining GitHub - microsoft/Megatron-DeepSpeed: Ongoing research training transformer language models at scale, including: BERT & GPT-2
- The science training does not support chatting as in GPT as this needs specialized fine-tuning.
- Argonne AuroraGPT is believed to be training on science literature and data like gene sequences and climate, which are combined to increase accuracy
- Feiyi suggested that Oak Ridge could give a talk after papers had been prepared and submitted to SC.
- There was a discussion on Summit and Crusher – Oak Ridge Leadership Computing Facility, which is a “small” Frontier. The software is more mature on Summit in Juri’s experience.
HPC Working Group
- There have been changes in the MLCommons working group described by Murali
- They are definitely switching to a rolling submission with a leaderboard
- They might merge with the MLCommons Training benchmarks