October 21, 2024 | Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agrawal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole Jean-Wu
LayerSkip is an end-to-end solution to accelerate inference in large language models (LLMs). The approach involves training with layer dropout and early exit loss, which allows the model to exit early during inference without additional layers. During training, layer dropout is applied with lower rates for earlier layers and higher rates for later layers, while early exit loss ensures the model's language model (LM) head can unembed outputs from different layers. This training method improves the accuracy of early exits and enables a self-speculative decoding approach during inference, where early layers generate tokens and later layers verify and correct them. This method reduces memory usage and leverages shared compute and activations between draft and verification stages.
The solution was tested on various LLM sizes and training types, including pretraining from scratch, continual pretraining, and fine-tuning on specific data domains and tasks. Results showed speedups of up to 2.16× on summarization, 1.82× on coding, and 2.0× on semantic parsing tasks. The self-speculative decoding approach outperforms traditional speculative decoding by reusing the KV cache and exit query, leading to faster inference with minimal accuracy loss.
The paper also explores the benefits of early exit in LLMs, showing that earlier layers can predict tokens with high accuracy, and that later layers are often not necessary for correct predictions. This insight motivates the use of layer dropout during training to reduce reliance on later layers. The proposed solution combines layer dropout and early exit loss to create a model that can exit early during inference, with a shared LM head for all layers, reducing training and inference memory usage.
The results demonstrate that LayerSkip improves inference speed while maintaining accuracy, making it suitable for deployment on commodity GPUs and edge devices. The method is particularly effective for tasks where early predictions are accurate, and the model can benefit from the reduced computational load of exiting early. The paper also highlights the importance of curriculum learning in training, where early exit loss is gradually introduced to improve model performance. Overall, LayerSkip offers a promising approach to accelerate LLM inference while maintaining accuracy, with potential applications in various NLP tasks.LayerSkip is an end-to-end solution to accelerate inference in large language models (LLMs). The approach involves training with layer dropout and early exit loss, which allows the model to exit early during inference without additional layers. During training, layer dropout is applied with lower rates for earlier layers and higher rates for later layers, while early exit loss ensures the model's language model (LM) head can unembed outputs from different layers. This training method improves the accuracy of early exits and enables a self-speculative decoding approach during inference, where early layers generate tokens and later layers verify and correct them. This method reduces memory usage and leverages shared compute and activations between draft and verification stages.
The solution was tested on various LLM sizes and training types, including pretraining from scratch, continual pretraining, and fine-tuning on specific data domains and tasks. Results showed speedups of up to 2.16× on summarization, 1.82× on coding, and 2.0× on semantic parsing tasks. The self-speculative decoding approach outperforms traditional speculative decoding by reusing the KV cache and exit query, leading to faster inference with minimal accuracy loss.
The paper also explores the benefits of early exit in LLMs, showing that earlier layers can predict tokens with high accuracy, and that later layers are often not necessary for correct predictions. This insight motivates the use of layer dropout during training to reduce reliance on later layers. The proposed solution combines layer dropout and early exit loss to create a model that can exit early during inference, with a shared LM head for all layers, reducing training and inference memory usage.
The results demonstrate that LayerSkip improves inference speed while maintaining accuracy, making it suitable for deployment on commodity GPUs and edge devices. The method is particularly effective for tasks where early predictions are accurate, and the model can benefit from the reduced computational load of exiting early. The paper also highlights the importance of curriculum learning in training, where early exit loss is gradually introduced to improve model performance. Overall, LayerSkip offers a promising approach to accelerate LLM inference while maintaining accuracy, with potential applications in various NLP tasks.