Skip to content

May 29, 2024

May 29, 2024

Present

Geoffrey Fox, Juri Papay, Gregor von Laszewski, Gregg Barrett, Christine Kirkpatrick, Wes Brewer, Hector Hernandez Corzo, Sujata Goswami, Armstrong Foundjem, Piotr Luszczek, Jeyan Thiyagalingam, Yuhan Rao, Victor Lu, Farzana Yasmin Ahmad, M B, A. Hashmi (last two new members but no time for their introduction)

Tentative Agenda

  • Hector Hernandez Corzo Special seminar
  • Any New Members Introduction (no time for other sections than special seminar)
  • Status of Papers
  • Status of Benchmarks
  • Science Foundation Models
  • Any Other Business

Special Seminar: Wednesday, May 29, 11.05 pm Eastern

Comments on Seminar

  • Speaker is Hector H. Corzo from the National Center for Computational Sciences at Oak Ridge National Laboratory with an audience of 16.
  • Hector started by describing the historical background of Neural Networks for sequences with the work of M.I. Jordan and J. Elmer. Analogies to the Brain were stressed.
  • Transition to transformers to avoid parallelization and memory issues in recurrent neural networks (RNN) culminating in OpenAI successes with GPT and ChatGPT. Began huge emphasis on LLMs by major players
  • Clear explanation of how NNs weight sequences for time series and languages.
  • Explain how distances between introduced in vector space
  • Describes QKV (Query, Key, value) concept and multiple heads in transformers. Excellent parallelization
  • Slide 23 compares RNN and Transformer weighting and then the structure of Transformer is described.
  • However slide 32 starts a description of the new RWKV Recurrent Neural Network that is parallelizable. This is applied to Language models with Slide 41 comparing computation needed and memory use for Transformers and RWKV.
  • RWKV uses vectors rather than matrices
  • Slide 44 has architecture
  • Slide 42 starts the discussion of RWKV energy needs which are lower than for traditional transformers
  • Slide 45 starts discussion of the natural way RWKV can grow models and its relation to the concept of Neuroplasticity in the Brain
  • Blue Jay (3B) grows to Tlanuwa (7B) to Quetzal (9B)
  • This growth feature makes the model intrinsically scalable (Gregg comment)
  • The importance of cleaning data is emphasized. Slide 48 and Dolma
  • The support of multiple languages and the relation to Piaget’s analysis of how Children lawn is discussed in slides 49-53
  • Slides 54-55 benchmark energy use and linguistic performance for Tlanuwa compared to other models like INCITE and Llama
  • The 9B model took 2000 node hours on Frontier
  • Currently just an initial model without fine-tuning
  • Slide 56 returns to the RNN Trabsformer comparison
  • In discussion noted that some tasks favor rmkv and some transformers
  • There was a lengthy discussion recorded on the video above
  • There is a paper cited above looking at RMKV for time series
  • Hector recommended starting on broad capabilities and then narrow down rather than other way round
  • Hector Juri and Jeyan will look at code which is open and include in their benchmarking studies
  • It is trained on Frontier (AMD) but runs faster on NVIDIA systems
  • Questions were asked on comparison with ORNL Forge project
  • Hector is working with MLENERGY group at Michigan ml.energy
  • Hector is working with Linux Foundation on data cleaning; the recording has his discussion with Christine on this