2024 | Amir Gholami1,2 Zhwei Yao1 Sehoon Kim1 Coleman Hooper1 Michael W. Mahoney1,2,3 Kurt Keutzer1
The availability of large unsupervised training data and neural scaling laws has led to a significant increase in model size and compute requirements for large language models (LLMs). However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS have scaled faster than DRAM and interconnect bandwidth, making memory the primary bottleneck in AI applications, especially in serving. This paper analyzes encoder and decoder Transformer models and shows how memory bandwidth can become the dominant bottleneck for decoder models. It argues for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
The computational cost of training recent SOTA Transformer models in NLP has been scaling at a rate of 750× every two years, while the model parameter size has been scaling at 410× every two years. In contrast, the peak hardware FLOPS have been scaling at 3.0× every two years, while both the DRAM and interconnect bandwidth have been increasing more slowly. This disparity has made memory, rather than compute, the primary bottleneck in AI applications, particularly in serving. The paper discusses the memory wall problem, which involves limited memory capacity, bandwidth, and latency. It highlights the increasing challenge of training and serving large models due to the memory wall.
The paper presents a case study of Transformer models, showing that the arithmetic intensity of the operations involved is a key factor in performance. It also discusses promising solutions for breaking the memory wall, including efficient training algorithms, efficient deployment techniques, and rethinking the design of AI accelerators. The paper concludes that the memory wall is becoming an increasingly challenging issue for AI applications, and that new approaches are needed to address this bottleneck.The availability of large unsupervised training data and neural scaling laws has led to a significant increase in model size and compute requirements for large language models (LLMs). However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS have scaled faster than DRAM and interconnect bandwidth, making memory the primary bottleneck in AI applications, especially in serving. This paper analyzes encoder and decoder Transformer models and shows how memory bandwidth can become the dominant bottleneck for decoder models. It argues for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
The computational cost of training recent SOTA Transformer models in NLP has been scaling at a rate of 750× every two years, while the model parameter size has been scaling at 410× every two years. In contrast, the peak hardware FLOPS have been scaling at 3.0× every two years, while both the DRAM and interconnect bandwidth have been increasing more slowly. This disparity has made memory, rather than compute, the primary bottleneck in AI applications, particularly in serving. The paper discusses the memory wall problem, which involves limited memory capacity, bandwidth, and latency. It highlights the increasing challenge of training and serving large models due to the memory wall.
The paper presents a case study of Transformer models, showing that the arithmetic intensity of the operations involved is a key factor in performance. It also discusses promising solutions for breaking the memory wall, including efficient training algorithms, efficient deployment techniques, and rethinking the design of AI accelerators. The paper concludes that the memory wall is becoming an increasingly challenging issue for AI applications, and that new approaches are needed to address this bottleneck.