The Unreasonable Ineffectiveness of the Deeper Layers

The Unreasonable Ineffectiveness of the Deeper Layers

26 Mar 2024 | Andrey Gromov*, Kushal Tirumala*, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts
We empirically study a simple layer-pruning strategy for popular open-weight pretrained large language models (LLMs), finding minimal performance degradation on question-answering benchmarks even after removing up to half of the layers. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers, then perform a small amount of parameter-efficient finetuning (PEFT) using methods like quantization and Low Rank Adapters (QLoRA) to heal the damage. Our results show that layer pruning can reduce computational resources for finetuning and improve inference memory and latency. Scientifically, the robustness of LLMs to layer deletion suggests that deeper layers may not be fully utilized in pretraining, or that shallow layers play a critical role in storing knowledge. We develop a method that uses layer representation similarity to identify optimal layers for pruning, then heals the damage with QLoRA. Our main result is that we can remove a substantial fraction of the deepest layers with minimal performance degradation. For example, for Llama-2-70B, we can eliminate up to half of the layers before performance collapses. Pruning is useful for reducing inference footprint and understanding network parameter usage. Our intuition for layer pruning comes from the residual structure of the transformer architecture, where the output of the final layer is a sum of all model layer outputs plus the embedded input. If the terms are not independent, removing a few layers may not significantly affect the output. We find that deeper layers are more similar to neighboring layers than shallow layers, suggesting a simpler pruning strategy: remove layers starting from the penultimate layer and proceed to the shallow layers. After healing with QLoRA, we achieve performance nearly matching the similarity-informed pruning strategy. This suggests that LLMs may not properly leverage deeper layers. Our results show that layer pruning can significantly reduce model size and improve efficiency. The effectiveness of our method is supported by experiments on various LLMs, demonstrating that performance remains robust even after significant layer removal. The results also highlight the importance of shallow layers in storing knowledge, as deeper layers can be pruned without significant impact on performance. Our findings suggest that layer pruning can be a powerful tool for reducing computational resources and improving inference efficiency, while also providing insights into the role of different layers in knowledge storage.We empirically study a simple layer-pruning strategy for popular open-weight pretrained large language models (LLMs), finding minimal performance degradation on question-answering benchmarks even after removing up to half of the layers. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers, then perform a small amount of parameter-efficient finetuning (PEFT) using methods like quantization and Low Rank Adapters (QLoRA) to heal the damage. Our results show that layer pruning can reduce computational resources for finetuning and improve inference memory and latency. Scientifically, the robustness of LLMs to layer deletion suggests that deeper layers may not be fully utilized in pretraining, or that shallow layers play a critical role in storing knowledge. We develop a method that uses layer representation similarity to identify optimal layers for pruning, then heals the damage with QLoRA. Our main result is that we can remove a substantial fraction of the deepest layers with minimal performance degradation. For example, for Llama-2-70B, we can eliminate up to half of the layers before performance collapses. Pruning is useful for reducing inference footprint and understanding network parameter usage. Our intuition for layer pruning comes from the residual structure of the transformer architecture, where the output of the final layer is a sum of all model layer outputs plus the embedded input. If the terms are not independent, removing a few layers may not significantly affect the output. We find that deeper layers are more similar to neighboring layers than shallow layers, suggesting a simpler pruning strategy: remove layers starting from the penultimate layer and proceed to the shallow layers. After healing with QLoRA, we achieve performance nearly matching the similarity-informed pruning strategy. This suggests that LLMs may not properly leverage deeper layers. Our results show that layer pruning can significantly reduce model size and improve efficiency. The effectiveness of our method is supported by experiments on various LLMs, demonstrating that performance remains robust even after significant layer removal. The results also highlight the importance of shallow layers in storing knowledge, as deeper layers can be pruned without significant impact on performance. Our findings suggest that layer pruning can be a powerful tool for reducing computational resources and improving inference efficiency, while also providing insights into the role of different layers in knowledge storage.
Reach us at info@study.space