24 Jan 2024 | Nathan Godey, Éric de la Clergerie, Benoît Sagot
Anisotropy is an inherent property of self-attention mechanisms in Transformers, observed across various modalities including NLP, speech, and vision. This phenomenon, characterized by hidden representations being close in angular distance (cosine similarity), is not solely due to token-level distributions but is also present in character-based and other modalities. The paper investigates the underlying causes of anisotropy, showing that it arises from the self-attention mechanism's inherent properties, particularly the sharpness of attention patterns. Empirical studies demonstrate that anisotropy is not limited to token-based models but is also observed in character-based models and in vision and speech models. The analysis reveals that the self-attention mechanism's sharpness increases during training, leading to higher anisotropy. The paper also shows that the drift in representations, caused by the softmax operation in cross-entropy loss, contributes to anisotropy. The study suggests that anisotropy is a fundamental aspect of Transformers-based models, and that understanding its causes could lead to the development of more isotropic models. The findings highlight the need for further research into the underlying factors that induce anisotropy in self-attention mechanisms.Anisotropy is an inherent property of self-attention mechanisms in Transformers, observed across various modalities including NLP, speech, and vision. This phenomenon, characterized by hidden representations being close in angular distance (cosine similarity), is not solely due to token-level distributions but is also present in character-based and other modalities. The paper investigates the underlying causes of anisotropy, showing that it arises from the self-attention mechanism's inherent properties, particularly the sharpness of attention patterns. Empirical studies demonstrate that anisotropy is not limited to token-based models but is also observed in character-based models and in vision and speech models. The analysis reveals that the self-attention mechanism's sharpness increases during training, leading to higher anisotropy. The paper also shows that the drift in representations, caused by the softmax operation in cross-entropy loss, contributes to anisotropy. The study suggests that anisotropy is a fundamental aspect of Transformers-based models, and that understanding its causes could lead to the development of more isotropic models. The findings highlight the need for further research into the underlying factors that induce anisotropy in self-attention mechanisms.