May 29, 2024
May 29, 2024
Present
Geoffrey Fox, Juri Papay, Gregor von Laszewski, Gregg Barrett, Christine Kirkpatrick, Wes Brewer, Hector Hernandez Corzo, Sujata Goswami, Armstrong Foundjem, Piotr Luszczek, Jeyan Thiyagalingam, Yuhan Rao, Victor Lu, Farzana Yasmin Ahmad, M B, A. Hashmi (last two new members but no time for their introduction)
Tentative Agenda
- Hector Hernandez Corzo Special seminar
- Any New Members Introduction (no time for other sections than special seminar)
- Status of Papers
- Status of Benchmarks
- Science Foundation Models
- Any Other Business
Special Seminar: Wednesday, May 29, 11.05 pm Eastern
- Speaker: Hector Hernandez Corzo, Oak Ridge DOE Laboratory
- Title: Is attention all that we need?
- Abstract: In this short presentation, I will start with an overview of the Transformer architecture, highlighting its key features and how it differs from its predecessors, the Recurrent Neural Networks (RNNs). We will explore the practicalities and applications of Transformer models, examining both their advantages and disadvantages. A key part of our discussion will critically assess the validity of the statement that "Attention is all we need." Furthermore, I will introduce an innovative attention-less RNN architecture and share insights from the models I have developed using this attention-less architecture, proposing it as a formidable alternative to Transformers. This session aims to provide a comparative analysis of these architectures, enabling the audience to critically evaluate the role and necessity of attention mechanisms in the evolution of modern AI technologies.
- Recorded Presentation (Starts at slide 6) VisibleScience Working Group (2024-05-29 08_12 GMT-7).mp4
- Presentation Slides HHC-May29.pdf
- RWKV-TS: Beyond Traditional Recurrent Neural Network for Time Series Tasks by Haowen Hou , F. Richard Yu [2401.09093] RWKV-TS: Beyond Traditional Recurrent Neural Network for Time Series Tasks 2401.09093v1.pdf
- Dialect prejudice predicts AI decisions about people’s character, employability, and criminality, Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, Sharese King [2403.00742] Dialect prejudice predicts AI decisions about people's character, employability, and criminality 2403.00742v1.pdf
- Large language models propagate race-based medicine, Jesutofunmi A. Omiye, Jenna C. Lester, Simon Spichak, Veronica Rotemberg & Roxana Daneshjou Large language models propagate race-based medicine s41746-023-00939-z.pdf
- Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research, Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo, [2402.00159] Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research 2402.00159v1.pdf
- An Attention Free Transformer, Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, Josh Susskind, [2105.14103] An Attention Free Transformer
- Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model, Alexandra Sasha Luccioni, Sylvain Viguier, Anne-Laure Ligozat Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model
- Patterns and networks of language control in bilingual language production. Qiming Yuan, Junjie Wu, Man Zhang, Zhaoqi Zhang, Mo Chen, Guosheng Ding, Chunming Lu & Taomei Guo Patterns and networks of language control in bilingual language production | Brain Structure and Function
- Bilingualism and domain-general cognitive functions from a neural perspective: A systematic review, Lily Tao, Gongting Wang, Miaomiao Zhu, Qing Cai Bilingualism and domain-general cognitive functions from a neural perspective: A systematic review - ScienceDirect
- Consequences of multilingualism for neural architecture, Sayuri Hayakawa & Viorica Marian Consequences of multilingualism for neural architecture | Behavioral and Brain Functions | Full Text
- Abnormal wiring of the connectome in adults with high-functioning autism spectrum disorder, Ulrika Roine, Timo Roine, Juha Salmi, Taina Nieminen-von Wendt, Pekka Tani, Sami Leppämäki, Pertti Rintahaka, Karen Caeyenberghs, Alexander Leemans & Mikko Sams Abnormal wiring of the connectome in adults with high-functioning autism spectrum disorder
Comments on Seminar
- Speaker is Hector H. Corzo from the National Center for Computational Sciences at Oak Ridge National Laboratory with an audience of 16.
- Hector started by describing the historical background of Neural Networks for sequences with the work of M.I. Jordan and J. Elmer. Analogies to the Brain were stressed.
- Transition to transformers to avoid parallelization and memory issues in recurrent neural networks (RNN) culminating in OpenAI successes with GPT and ChatGPT. Began huge emphasis on LLMs by major players
- Clear explanation of how NNs weight sequences for time series and languages.
- Explain how distances between introduced in vector space
- Describes QKV (Query, Key, value) concept and multiple heads in transformers. Excellent parallelization
- Slide 23 compares RNN and Transformer weighting and then the structure of Transformer is described.
- However slide 32 starts a description of the new RWKV Recurrent Neural Network that is parallelizable. This is applied to Language models with Slide 41 comparing computation needed and memory use for Transformers and RWKV.
- RWKV uses vectors rather than matrices
- Slide 44 has architecture
- Slide 42 starts the discussion of RWKV energy needs which are lower than for traditional transformers
- Slide 45 starts discussion of the natural way RWKV can grow models and its relation to the concept of Neuroplasticity in the Brain
- Blue Jay (3B) grows to Tlanuwa (7B) to Quetzal (9B)
- This growth feature makes the model intrinsically scalable (Gregg comment)
- The importance of cleaning data is emphasized. Slide 48 and Dolma
- The support of multiple languages and the relation to Piaget’s analysis of how Children lawn is discussed in slides 49-53
- Slides 54-55 benchmark energy use and linguistic performance for Tlanuwa compared to other models like INCITE and Llama
- The 9B model took 2000 node hours on Frontier
- Currently just an initial model without fine-tuning
- Slide 56 returns to the RNN Trabsformer comparison
- In discussion noted that some tasks favor rmkv and some transformers
- There was a lengthy discussion recorded on the video above
- There is a paper cited above looking at RMKV for time series
- Hector recommended starting on broad capabilities and then narrow down rather than other way round
- Hector Juri and Jeyan will look at code which is open and include in their benchmarking studies
- It is trained on Frontier (AMD) but runs faster on NVIDIA systems
- Questions were asked on comparison with ORNL Forge project
- Hector is working with MLENERGY group at Michigan ml.energy
- Hector is working with Linux Foundation on data cleaning; the recording has his discussion with Christine on this