[slides and audio] LayerSkip%3A Enabling Early Exit Inference and Self-Speculative Decoding

LayerSkip is an end-to-end solution designed to accelerate the inference of large language models (LLMs) by enabling early exit during training and inference. The key contributions of LayerSkip include: 1. **Training with Layer Dropout and Early Exit Loss**: During training, LayerSkip applies layer dropout to early layers and an early exit loss to all transformer layers. This training recipe improves the accuracy of early exits without adding auxiliary layers or modules. 2. **Inference with Early Exit**: The trained model can exit early at earlier layers during inference, significantly reducing computational costs. This approach creates sub-models within the same model, reducing the number of layers required for each token prediction. 3. **Self-Speculative Decoding**: A novel approach that combines early exit with speculative decoding. It uses early exit to generate tokens and verifies and corrects them using the remaining layers. This method reduces memory footprint and benefits from shared compute and activations between the draft and verification stages. Experiments on various LLMs and tasks show that LayerSkip achieves speedups of up to 2.16× on summarization, 1.82× on coding, and 2.0× on semantic parsing. The solution is applicable to different types of training, including pretraining from scratch, continual pretraining, and fine-tuning on specific data domains and tasks. LayerSkip demonstrates significant improvements in accuracy and speed, making it a promising approach for deploying LLMs on resource-constrained devices.LayerSkip is an end-to-end solution designed to accelerate the inference of large language models (LLMs) by enabling early exit during training and inference. The key contributions of LayerSkip include: 1. **Training with Layer Dropout and Early Exit Loss**: During training, LayerSkip applies layer dropout to early layers and an early exit loss to all transformer layers. This training recipe improves the accuracy of early exits without adding auxiliary layers or modules. 2. **Inference with Early Exit**: The trained model can exit early at earlier layers during inference, significantly reducing computational costs. This approach creates sub-models within the same model, reducing the number of layers required for each token prediction. 3. **Self-Speculative Decoding**: A novel approach that combines early exit with speculative decoding. It uses early exit to generate tokens and verifies and corrects them using the remaining layers. This method reduces memory footprint and benefits from shared compute and activations between the draft and verification stages. Experiments on various LLMs and tasks show that LayerSkip achieves speedups of up to 2.16× on summarization, 1.82× on coding, and 2.0× on semantic parsing. The solution is applicable to different types of training, including pretraining from scratch, continual pretraining, and fine-tuning on specific data domains and tasks. LayerSkip demonstrates significant improvements in accuracy and speed, making it a promising approach for deploying LLMs on resource-constrained devices.

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

October 21, 2024 | Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agrawal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole Jean-Wu