Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward

Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward

24 Apr 2024 | Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah and Deepak Gupta
This survey explores current challenges and future directions for improving the efficiency of large language models (LLMs). Despite their impressive performance, LLMs face challenges due to high computational and memory requirements during inference. Recent advancements in model compression and system-level optimization aim to enhance LLM inference. The survey evaluates various compression techniques on LLaMA(2)-7B, providing practical insights for efficient LLM deployment. The empirical analysis highlights the effectiveness of these methods. The paper identifies current limitations and discusses potential future directions to improve LLM inference efficiency. The codebase is released for reproduction of results. The survey covers model compression techniques such as pruning, quantization, and knowledge distillation, as well as system-level optimizations. Pruning techniques include structured pruning, unstructured pruning, and fine-tuning-free methods. Quantization methods such as 4-bit and 8-bit quantization are effective in reducing model size and improving inference speed. Knowledge distillation techniques transfer knowledge from larger models to smaller ones, enabling efficient deployment. Low-rank approximations and tensor decomposition also contribute to model compression. System-level optimizations include paged attention, tensor parallelism, pipeline parallelism, and CPU/GPU offloading. These methods improve the runtime efficiency of LLMs. The survey also discusses challenges such as computational intensity, rank selection in low-rank approximation, and ethical considerations. The paper concludes that further research is needed to achieve efficient LLM inference, with a focus on model compression, system-level optimizations, and alternative languages for faster execution.This survey explores current challenges and future directions for improving the efficiency of large language models (LLMs). Despite their impressive performance, LLMs face challenges due to high computational and memory requirements during inference. Recent advancements in model compression and system-level optimization aim to enhance LLM inference. The survey evaluates various compression techniques on LLaMA(2)-7B, providing practical insights for efficient LLM deployment. The empirical analysis highlights the effectiveness of these methods. The paper identifies current limitations and discusses potential future directions to improve LLM inference efficiency. The codebase is released for reproduction of results. The survey covers model compression techniques such as pruning, quantization, and knowledge distillation, as well as system-level optimizations. Pruning techniques include structured pruning, unstructured pruning, and fine-tuning-free methods. Quantization methods such as 4-bit and 8-bit quantization are effective in reducing model size and improving inference speed. Knowledge distillation techniques transfer knowledge from larger models to smaller ones, enabling efficient deployment. Low-rank approximations and tensor decomposition also contribute to model compression. System-level optimizations include paged attention, tensor parallelism, pipeline parallelism, and CPU/GPU offloading. These methods improve the runtime efficiency of LLMs. The survey also discusses challenges such as computational intensity, rank selection in low-rank approximation, and ethical considerations. The paper concludes that further research is needed to achieve efficient LLM inference, with a focus on model compression, system-level optimizations, and alternative languages for faster execution.
Reach us at info@study.space
Understanding Faster and Lighter LLMs%3A A Survey on Current Challenges and Way Forward