Are Sixteen Heads Really Better than One?

Are Sixteen Heads Really Better than One?

4 Nov 2019 | Paul Michel, Omer Levy, Graham Neubig
This paper investigates the effectiveness of multi-headed attention (MHA) in neural language models, particularly in transformer-based models for machine translation and BERT. The authors find that, despite the theoretical advantages of MHA, many attention heads can be removed without significantly affecting performance. They demonstrate that in both WMT and BERT models, a large percentage of attention heads can be pruned without noticeable degradation in test performance. In some cases, entire layers can be reduced to a single head without impacting performance. The study proposes a greedy pruning algorithm that iteratively removes attention heads with the least impact on model performance. This approach leads to significant improvements in inference efficiency, with BERT models achieving up to a 17.5% increase in inference speed after pruning. The results suggest that most heads are redundant, and that the most important heads are determined early in the training process. The analysis reveals that encoder-decoder attention layers in machine translation models are more reliant on multi-headedness than self-attention layers. Additionally, the importance of individual heads increases as training progresses, indicating an interaction between multi-headedness and training dynamics. The study also shows that pruning heads can lead to significant efficiency gains, as each head represents a non-negligible portion of the total parameters in the model. This makes MHA particularly useful in memory-constrained settings. The findings suggest that while MHA provides theoretical advantages, in practice, many heads are redundant, and models can be significantly pruned without loss of performance. This has important implications for model efficiency and deployment in real-world applications.This paper investigates the effectiveness of multi-headed attention (MHA) in neural language models, particularly in transformer-based models for machine translation and BERT. The authors find that, despite the theoretical advantages of MHA, many attention heads can be removed without significantly affecting performance. They demonstrate that in both WMT and BERT models, a large percentage of attention heads can be pruned without noticeable degradation in test performance. In some cases, entire layers can be reduced to a single head without impacting performance. The study proposes a greedy pruning algorithm that iteratively removes attention heads with the least impact on model performance. This approach leads to significant improvements in inference efficiency, with BERT models achieving up to a 17.5% increase in inference speed after pruning. The results suggest that most heads are redundant, and that the most important heads are determined early in the training process. The analysis reveals that encoder-decoder attention layers in machine translation models are more reliant on multi-headedness than self-attention layers. Additionally, the importance of individual heads increases as training progresses, indicating an interaction between multi-headedness and training dynamics. The study also shows that pruning heads can lead to significant efficiency gains, as each head represents a non-negligible portion of the total parameters in the model. This makes MHA particularly useful in memory-constrained settings. The findings suggest that while MHA provides theoretical advantages, in practice, many heads are redundant, and models can be significantly pruned without loss of performance. This has important implications for model efficiency and deployment in real-world applications.
Reach us at info@study.space
Understanding Are Sixteen Heads Really Better than One%3F