Understanding Are Sixteen Heads Really Better than One%3F

The paper "Are Sixteen Heads Really Better than One?" by Paul Michel, Omer Levy, and Graham Neubig explores the effectiveness of multi-headed attention in neural models, particularly in natural language processing (NLP) tasks such as machine translation and BERT-based models. The authors observe that, surprisingly, a significant number of attention heads can be removed from trained models without significantly impacting performance. They find that many layers can be reduced to a single head, and propose a greedy algorithm for pruning attention heads. The pruning process improves inference speed and memory efficiency while maintaining or improving accuracy. The study also reveals that encoder-decoder attention layers in machine translation models are more sensitive to pruning compared to self-attention layers, suggesting that multi-headedness plays a critical role in these components. Additionally, the authors provide evidence that the importance of heads increases during training, indicating an interaction between multi-headedness and training dynamics. The findings suggest that models can be optimized for efficiency by carefully pruning unnecessary attention heads.The paper "Are Sixteen Heads Really Better than One?" by Paul Michel, Omer Levy, and Graham Neubig explores the effectiveness of multi-headed attention in neural models, particularly in natural language processing (NLP) tasks such as machine translation and BERT-based models. The authors observe that, surprisingly, a significant number of attention heads can be removed from trained models without significantly impacting performance. They find that many layers can be reduced to a single head, and propose a greedy algorithm for pruning attention heads. The pruning process improves inference speed and memory efficiency while maintaining or improving accuracy. The study also reveals that encoder-decoder attention layers in machine translation models are more sensitive to pruning compared to self-attention layers, suggesting that multi-headedness plays a critical role in these components. Additionally, the authors provide evidence that the importance of heads increases during training, indicating an interaction between multi-headedness and training dynamics. The findings suggest that models can be optimized for efficiency by carefully pruning unnecessary attention heads.

Are Sixteen Heads Really Better than One?

4 Nov 2019 | Paul Michel, Omer Levy, Graham Neubig