Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward

Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward

24 Apr 2024 | Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta
This paper provides a comprehensive survey of methods aimed at improving the efficiency of Large Language Models (LLMs) through model compression and system-level optimizations. Despite the impressive performance of LLMs, their substantial computational and memory requirements pose challenges for widespread adoption, especially in resource-constrained environments. The authors evaluate various compression techniques, including architecture pruning, quantization, and knowledge distillation, using the LLaMA/(2)-7B model. They also discuss system-level optimizations such as paged attention, tensor parallelism, and pipeline parallelism. The empirical analysis highlights the effectiveness of these methods and identifies current limitations. The paper concludes with a discussion on potential future directions, including training-free pruning methods, localized distillation, and the development of optimized inference engines. The authors release the code and benchmarks to facilitate reproducibility and further research.This paper provides a comprehensive survey of methods aimed at improving the efficiency of Large Language Models (LLMs) through model compression and system-level optimizations. Despite the impressive performance of LLMs, their substantial computational and memory requirements pose challenges for widespread adoption, especially in resource-constrained environments. The authors evaluate various compression techniques, including architecture pruning, quantization, and knowledge distillation, using the LLaMA/(2)-7B model. They also discuss system-level optimizations such as paged attention, tensor parallelism, and pipeline parallelism. The empirical analysis highlights the effectiveness of these methods and identifies current limitations. The paper concludes with a discussion on potential future directions, including training-free pruning methods, localized distillation, and the development of optimized inference engines. The authors release the code and benchmarks to facilitate reproducibility and further research.
Reach us at info@study.space
[slides and audio] Faster and Lighter LLMs%3A A Survey on Current Challenges and Way Forward