Understanding LLM Inference Unveiled%3A Survey and Roofline Model Insights

The paper "LLM Inference Unveiled: Survey and Rooftine Model Insights" by Zhihang Yuan et al. provides a comprehensive survey of efficient Large Language Model (LLM) inference, focusing on practical aspects and introducing a systematic framework based on the Rooftine model. The authors highlight the challenges and opportunities in deploying LLMs on hardware devices, emphasizing the need for innovative solutions to make LLM inference more accessible and sustainable. They categorize strategies for improving LLM inference efficiency into four main areas: model compression, fast decoding algorithm design, system-level optimization, and hardware-level optimization. The paper also introduces LLM-Viewer, a tool that uses the Rooftine model to analyze the bottlenecks in LLM deployments, providing insights into memory access and computation. The survey covers various techniques such as quantization, pruning, knowledge distillation, and low-rank factorization, with detailed analyses of their impact on LLM inference. The authors discuss the evolution of quantization techniques, including Post-Training Quantization (PTQ), Quantization-Aware Training (QAT), and Quantization for Parameter-Efficient Fine-Tuning (Q-PEFT), and explore pruning methods like unstructured and structured pruning. Additionally, they delve into knowledge distillation techniques, both white-box and black-box, and their applications in LLM compression. The paper aims to provide a clear understanding of the current state of research and practical implementation in efficient LLM deployment.The paper "LLM Inference Unveiled: Survey and Rooftine Model Insights" by Zhihang Yuan et al. provides a comprehensive survey of efficient Large Language Model (LLM) inference, focusing on practical aspects and introducing a systematic framework based on the Rooftine model. The authors highlight the challenges and opportunities in deploying LLMs on hardware devices, emphasizing the need for innovative solutions to make LLM inference more accessible and sustainable. They categorize strategies for improving LLM inference efficiency into four main areas: model compression, fast decoding algorithm design, system-level optimization, and hardware-level optimization. The paper also introduces LLM-Viewer, a tool that uses the Rooftine model to analyze the bottlenecks in LLM deployments, providing insights into memory access and computation. The survey covers various techniques such as quantization, pruning, knowledge distillation, and low-rank factorization, with detailed analyses of their impact on LLM inference. The authors discuss the evolution of quantization techniques, including Post-Training Quantization (PTQ), Quantization-Aware Training (QAT), and Quantization for Parameter-Efficient Fine-Tuning (Q-PEFT), and explore pruning methods like unstructured and structured pruning. Additionally, they delve into knowledge distillation techniques, both white-box and black-box, and their applications in LLM compression. The paper aims to provide a clear understanding of the current state of research and practical implementation in efficient LLM deployment.

LLM Inference Unveiled: Survey and Roofline Model Insights

1 May 2024 | Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer