LLM Inference Unveiled: Survey and Roofline Model Insights

LLM Inference Unveiled: Survey and Roofline Model Insights

1 May 2024 | Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer
This paper presents a comprehensive survey of efficient Large Language Model (LLM) inference, focusing on practical aspects and introducing a Roofline model-based framework for systematic analysis of LLM inference techniques. The survey identifies bottlenecks in deploying LLMs on hardware devices and provides insights into memory and computational requirements. It systematically collates the latest advancements in efficient LLM inference, covering areas such as model compression (e.g., quantization), algorithm improvements (e.g., speculative decoding), and system and hardware-level enhancements (e.g., operator fusion). The survey introduces a Roofline model to analyze the impact of these methods on memory access and computation, offering valuable insights for practical implementation. The survey also presents a tool, LLM-Viewer, which uses the Roofline model to analyze the bottlenecks of deploying LLMs on specific hardware devices. The tool enables the analysis of LLM performance and efficiency on various hardware platforms, providing insights into LLM inference and performance optimization. The survey discusses various techniques for LLM compression, including quantization, pruning, and knowledge distillation, and analyzes their impact on LLM inference. The survey also presents a detailed analysis of the Roofline model and its application in analyzing LLM inference. The survey concludes that the Roofline model provides a valuable framework for understanding and optimizing LLM inference, and that the LLM-Viewer tool is an essential resource for researchers and practitioners in the field.This paper presents a comprehensive survey of efficient Large Language Model (LLM) inference, focusing on practical aspects and introducing a Roofline model-based framework for systematic analysis of LLM inference techniques. The survey identifies bottlenecks in deploying LLMs on hardware devices and provides insights into memory and computational requirements. It systematically collates the latest advancements in efficient LLM inference, covering areas such as model compression (e.g., quantization), algorithm improvements (e.g., speculative decoding), and system and hardware-level enhancements (e.g., operator fusion). The survey introduces a Roofline model to analyze the impact of these methods on memory access and computation, offering valuable insights for practical implementation. The survey also presents a tool, LLM-Viewer, which uses the Roofline model to analyze the bottlenecks of deploying LLMs on specific hardware devices. The tool enables the analysis of LLM performance and efficiency on various hardware platforms, providing insights into LLM inference and performance optimization. The survey discusses various techniques for LLM compression, including quantization, pruning, and knowledge distillation, and analyzes their impact on LLM inference. The survey also presents a detailed analysis of the Roofline model and its application in analyzing LLM inference. The survey concludes that the Roofline model provides a valuable framework for understanding and optimizing LLM inference, and that the LLM-Viewer tool is an essential resource for researchers and practitioners in the field.
Reach us at info@study.space