Transformers need glasses! 62 Information over-squashing in language tasks

Transformers need glasses! 62 Information over-squashing in language tasks

6 Jun 2024 | Federico Barbero*, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João G.M. Araújo, Alex Vitvitskyi, Razvan Pascanu, Petar Veličković
This paper investigates the limitations of decoder-only Transformers, which are the backbone of many large language models (LLMs). The authors analyze the propagation of information in these models, focusing on the representations of the last token in the final layer, which are used for next-token prediction. They discover a phenomenon called *representational collapse*, where distinct input sequences can yield representations that are arbitrarily close to each other, leading to errors in tasks such as counting or copying. This issue is exacerbated by the low-precision floating-point formats commonly used in modern LLMs. Additionally, the authors show that decoder-only Transformers can lose sensitivity to specific tokens, a phenomenon known as *over-squashing*, similar to what occurs in graph neural networks (GNNs). Empirical evidence supports these theoretical findings, and the paper proposes simple solutions to mitigate these issues. The work highlights the importance of understanding and addressing these limitations to improve the robustness and reliability of LLMs.This paper investigates the limitations of decoder-only Transformers, which are the backbone of many large language models (LLMs). The authors analyze the propagation of information in these models, focusing on the representations of the last token in the final layer, which are used for next-token prediction. They discover a phenomenon called *representational collapse*, where distinct input sequences can yield representations that are arbitrarily close to each other, leading to errors in tasks such as counting or copying. This issue is exacerbated by the low-precision floating-point formats commonly used in modern LLMs. Additionally, the authors show that decoder-only Transformers can lose sensitivity to specific tokens, a phenomenon known as *over-squashing*, similar to what occurs in graph neural networks (GNNs). Empirical evidence supports these theoretical findings, and the paper proposes simple solutions to mitigate these issues. The work highlights the importance of understanding and addressing these limitations to improve the robustness and reliability of LLMs.
Reach us at info@study.space
[slides] Transformers need glasses! Information over-squashing in language tasks | StudySpace