What Matters in Transformers? Not All Attention is Needed

What Matters in Transformers? Not All Attention is Needed

8 Aug 2024 | Shwai He, Guoheng Sun, Zhenyu Shen, Ang Li
This paper investigates the redundancy in different modules of Transformer-based large language models (LLMs), including Blocks, MLP, and Attention layers. While attention layers are essential for transformers, the study finds that a large proportion of attention layers exhibit high similarity and can be safely pruned without degrading performance, reducing memory and computation costs. The authors propose a novel strategy, Joint Layer Drop, which jointly drops Attention and MLP layers, achieving improved performance and higher dropping ratios. Extensive experiments show that Llama-3-70B maintains comparable performance even after pruning half of the attention layers. The study provides valuable insights for future network architecture design, demonstrating that attention layers can be significantly pruned without performance loss. The code is available at https://github.com/Shwai-He/LLM-Drop. The paper also discusses the effectiveness of module dropping on various models, including Llama-2-13B, Mistral-7B, and Llama-2-70B, showing that attention layers are more redundant than MLP layers. The study highlights the potential of structured redundancy in LLMs for improving efficiency and suggests that future research could focus on reducing the number of attention layers, particularly in deeper layers. The findings also indicate that optimizing the training process could unlock the full potential of attention layers. The paper concludes that the proposed methods effectively identify both important and redundant layers, offering valuable insights for future network design.This paper investigates the redundancy in different modules of Transformer-based large language models (LLMs), including Blocks, MLP, and Attention layers. While attention layers are essential for transformers, the study finds that a large proportion of attention layers exhibit high similarity and can be safely pruned without degrading performance, reducing memory and computation costs. The authors propose a novel strategy, Joint Layer Drop, which jointly drops Attention and MLP layers, achieving improved performance and higher dropping ratios. Extensive experiments show that Llama-3-70B maintains comparable performance even after pruning half of the attention layers. The study provides valuable insights for future network architecture design, demonstrating that attention layers can be significantly pruned without performance loss. The code is available at https://github.com/Shwai-He/LLM-Drop. The paper also discusses the effectiveness of module dropping on various models, including Llama-2-13B, Mistral-7B, and Llama-2-70B, showing that attention layers are more redundant than MLP layers. The study highlights the potential of structured redundancy in LLMs for improving efficiency and suggests that future research could focus on reducing the number of attention layers, particularly in deeper layers. The findings also indicate that optimizing the training process could unlock the full potential of attention layers. The paper concludes that the proposed methods effectively identify both important and redundant layers, offering valuable insights for future network design.
Reach us at info@study.space
[slides and audio] What Matters in Transformers%3F Not All Attention is Needed