8 Aug 2024 | Shuai He*, Guoheng Sun*, Zhenyu Shen, Ang Li†
The paper "What Matters in Transformers? Not All Attention is Needed" by Shwai He, Guoheng Sun, Zhenyu Shen, and Ang Li explores the redundancy in Transformer-based large language models (LLMs) and proposes methods to reduce this redundancy without compromising performance. The authors investigate the redundancy across different modules, including Blocks, MLP layers, and Attention layers. They find that while Attention layers are essential for transformers, a significant portion of them exhibit high similarity and can be pruned without affecting performance. This leads to reduced memory and computational costs.
The paper introduces a similarity-based metric to measure redundancy and proposes two pruning techniques: Block Drop and Layer Drop. Block Drop removes entire blocks, while Layer Drop focuses on individual layers. The authors also propose Joint Layer Drop, which jointly drops both Attention and MLP layers, achieving higher dropping ratios and improved performance.
Experiments on models like Llama-2-70B and Mistral-7B demonstrate that dropping up to 50% of Attention layers does not significantly impact performance. Additionally, the proposed methods show robustness to different calibration datasets and sample sizes. The authors conclude that their findings provide valuable insights for designing more efficient network architectures and suggest future research directions, such as optimizing training processes and exploring alternatives to Attention layers.The paper "What Matters in Transformers? Not All Attention is Needed" by Shwai He, Guoheng Sun, Zhenyu Shen, and Ang Li explores the redundancy in Transformer-based large language models (LLMs) and proposes methods to reduce this redundancy without compromising performance. The authors investigate the redundancy across different modules, including Blocks, MLP layers, and Attention layers. They find that while Attention layers are essential for transformers, a significant portion of them exhibit high similarity and can be pruned without affecting performance. This leads to reduced memory and computational costs.
The paper introduces a similarity-based metric to measure redundancy and proposes two pruning techniques: Block Drop and Layer Drop. Block Drop removes entire blocks, while Layer Drop focuses on individual layers. The authors also propose Joint Layer Drop, which jointly drops both Attention and MLP layers, achieving higher dropping ratios and improved performance.
Experiments on models like Llama-2-70B and Mistral-7B demonstrate that dropping up to 50% of Attention layers does not significantly impact performance. Additionally, the proposed methods show robustness to different calibration datasets and sample sizes. The authors conclude that their findings provide valuable insights for designing more efficient network architectures and suggest future research directions, such as optimizing training processes and exploring alternatives to Attention layers.