The Unreasonable Ineffectiveness of the Deeper Layers

The Unreasonable Ineffectiveness of the Deeper Layers

26 Mar 2024 | Andrey Gromov*, Kushal Tirumala*, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts
This paper explores a layer-pruning strategy for large language models (LLMs) to reduce their computational and memory requirements. The authors find that pruning a significant fraction of the deeper layers of LLMs, up to about half, does not significantly degrade performance on question-answering benchmarks. They identify the optimal layers to prune by considering the similarity between representations across layers and then "heal" the pruning-induced mismatch with a small amount of fine-tuning using parameter-efficient methods like quantization and Low Rank Adapters (QLoRA). The results suggest that either current pretraining methods are not leveraging deeper layers effectively or that these layers play a critical role in storing knowledge. The pruning method complements other efficiency techniques like quantization and fine-tuning, making it a powerful tool for reducing the computational and memory footprint of LLMs. The paper also discusses the scientific implications of the robustness of LLMs to layer pruning, suggesting that shallow layers may be more important for knowledge storage.This paper explores a layer-pruning strategy for large language models (LLMs) to reduce their computational and memory requirements. The authors find that pruning a significant fraction of the deeper layers of LLMs, up to about half, does not significantly degrade performance on question-answering benchmarks. They identify the optimal layers to prune by considering the similarity between representations across layers and then "heal" the pruning-induced mismatch with a small amount of fine-tuning using parameter-efficient methods like quantization and Low Rank Adapters (QLoRA). The results suggest that either current pretraining methods are not leveraging deeper layers effectively or that these layers play a critical role in storing knowledge. The pruning method complements other efficiency techniques like quantization and fine-tuning, making it a powerful tool for reducing the computational and memory footprint of LLMs. The paper also discusses the scientific implications of the robustness of LLMs to layer pruning, suggesting that shallow layers may be more important for knowledge storage.
Reach us at info@study.space