[slides] AI and Memory Wall

The paper "AI and Memory Wall" by Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, and Kurt Keutzer explores the growing performance bottleneck in AI applications, particularly in serving Large Language Models (LLMs), due to the disparity between the rapid increase in model size and the slower growth of memory bandwidth and interconnect bandwidth. Over the past 20 years, peak server hardware FLOPS has increased by 3.0× every 2 years, while DRAM and interconnect bandwidth have only increased by 1.6× and 1.4× every 2 years, respectively. This has made memory, rather than compute, the primary bottleneck in AI applications. The authors analyze encoder and decoder Transformer models and demonstrate how memory bandwidth can become the dominant bottleneck for decoder models. They argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation. The paper highlights the importance of arithmetic intensity, which measures the number of FLOPs per byte loaded from memory, and shows that models with higher arithmetic intensity can run faster with the same or more FLOPs. The case study of BERT-Base, BERT-Large, and GPT-2 models reveals that GPT-2 has significantly higher latency due to its lower arithmetic intensity and higher memory operations. The paper also discusses approaches to address the memory wall, such as using second-order stochastic optimization methods, reducing memory footprint, and designing optimization algorithms that are robust to low-precision training. Additionally, it explores efficient deployment strategies like model compression and the potential of small language models. Finally, the authors suggest rethinking the design of AI accelerators to achieve better compute/bandwidth trade-offs, emphasizing the need for more efficient caching and higher-capacity DRAM. The conclusion emphasizes the urgency of addressing the memory wall to enable the widespread adoption of large AI models.The paper "AI and Memory Wall" by Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, and Kurt Keutzer explores the growing performance bottleneck in AI applications, particularly in serving Large Language Models (LLMs), due to the disparity between the rapid increase in model size and the slower growth of memory bandwidth and interconnect bandwidth. Over the past 20 years, peak server hardware FLOPS has increased by 3.0× every 2 years, while DRAM and interconnect bandwidth have only increased by 1.6× and 1.4× every 2 years, respectively. This has made memory, rather than compute, the primary bottleneck in AI applications. The authors analyze encoder and decoder Transformer models and demonstrate how memory bandwidth can become the dominant bottleneck for decoder models. They argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation. The paper highlights the importance of arithmetic intensity, which measures the number of FLOPs per byte loaded from memory, and shows that models with higher arithmetic intensity can run faster with the same or more FLOPs. The case study of BERT-Base, BERT-Large, and GPT-2 models reveals that GPT-2 has significantly higher latency due to its lower arithmetic intensity and higher memory operations. The paper also discusses approaches to address the memory wall, such as using second-order stochastic optimization methods, reducing memory footprint, and designing optimization algorithms that are robust to low-precision training. Additionally, it explores efficient deployment strategies like model compression and the potential of small language models. Finally, the authors suggest rethinking the design of AI accelerators to achieve better compute/bandwidth trade-offs, emphasizing the need for more efficient caching and higher-capacity DRAM. The conclusion emphasizes the urgency of addressing the memory wall to enable the widespread adoption of large AI models.

AI and Memory Wall

21 Mar 2024 | Amir Gholami1,2 Zhewei Yao1 Schoon Kim1 Coleman Hooper1 Michael W. Mahoney1,2,3 Kurt Keutzer1