[slides] Anisotropy Is Inherent to Self-Attention in Transformers

The paper "Anisotropy Is Inherent to Self-Attention in Transformers" by Nathan Godey, Éric de la Clergerie, and Benoît Sagot explores the phenomenon of anisotropy in Transformers-based models, particularly in the context of natural language processing (NLP). Anisotropy refers to the property where hidden representations become unexpectedly similar in terms of angular distance (cosine-similarity), which is a common issue in self-supervised learning methods based on Transformers. The authors argue that anisotropy is not solely a consequence of optimizing the cross-entropy loss on long-tailed token distributions but is inherent to the self-attention mechanism. Key contributions of the paper include: 1. **Empirical Observations**: The authors demonstrate that anisotropy can be observed in language models with character-aware architectures, extending their findings to models trained on other modalities such as image and audio data. 2. **Analysis of Transformer Blocks**: They study untrained Transformer blocks to understand the relationship between anisotropy and the sharpness of the self-attention mechanism. 3. **Training Dynamics**: The paper investigates how query and key distributions drift during training, leading to sharper attention patterns and increased anisotropy. The authors conclude that anisotropy is a fundamental property of the self-attention mechanism in Transformers, influenced by the sharpness of attention patterns. They suggest that revising the self-attention operation could help reduce anisotropy without distorting the geometry of hidden representations. The paper also discusses the implications of anisotropy on downstream performance and the potential for developing isotropic architectures.The paper "Anisotropy Is Inherent to Self-Attention in Transformers" by Nathan Godey, Éric de la Clergerie, and Benoît Sagot explores the phenomenon of anisotropy in Transformers-based models, particularly in the context of natural language processing (NLP). Anisotropy refers to the property where hidden representations become unexpectedly similar in terms of angular distance (cosine-similarity), which is a common issue in self-supervised learning methods based on Transformers. The authors argue that anisotropy is not solely a consequence of optimizing the cross-entropy loss on long-tailed token distributions but is inherent to the self-attention mechanism. Key contributions of the paper include: 1. **Empirical Observations**: The authors demonstrate that anisotropy can be observed in language models with character-aware architectures, extending their findings to models trained on other modalities such as image and audio data. 2. **Analysis of Transformer Blocks**: They study untrained Transformer blocks to understand the relationship between anisotropy and the sharpness of the self-attention mechanism. 3. **Training Dynamics**: The paper investigates how query and key distributions drift during training, leading to sharper attention patterns and increased anisotropy. The authors conclude that anisotropy is a fundamental property of the self-attention mechanism in Transformers, influenced by the sharpness of attention patterns. They suggest that revising the self-attention operation could help reduce anisotropy without distorting the geometry of hidden representations. The paper also discusses the implications of anisotropy on downstream performance and the potential for developing isotropic architectures.

Anisotropy Is Inherent to Self-Attention in Transformers

24 Jan 2024 | Nathan Godey, Éric de la Clergerie, Benoît Sagot