6 Jun 2024 | Federico Barbero*, Andrea Banino, Steven Kapturowski, Dharshan Kumar, João G.M. Araújo, Alex Vitvitskyi, Razvan Pascanu, Petar Veličković
Transformers used in large language models (LLMs) face challenges in accurately representing sequences, leading to representational collapse and over-squashing. This paper analyzes how information propagates in decoder-only Transformers, focusing on the representation of the last token in the final layer, which is used for next-token prediction. The study reveals that certain distinct input sequences can produce arbitrarily close representations, causing the model to fail in tasks like counting and copying. This issue is exacerbated by low-precision floating-point formats commonly used in LLMs.
The paper also identifies over-squashing as a phenomenon where information from earlier tokens in the sequence has more paths to reach the final token representation, leading to information loss. This is similar to over-squashing in graph neural networks (GNNs) and relates to vanishing gradients. Theoretical analysis shows that the representational collapse is due to the limited precision of floating-point numbers and the structure of the Transformer's attention mechanism.
Empirical experiments on contemporary LLMs, such as Gemini 1.5 and Gemma, demonstrate these issues in tasks like copying and counting. For example, Gemini 1.5 fails to correctly copy the last token in a sequence of ones when the sequence length is large, and struggles with counting tasks as the sequence length increases. The results show that the model's ability to distinguish between sequences is limited by the precision of the representations.
The paper proposes solutions to mitigate these issues, such as introducing additional tokens in the sequence to keep representations distinct. It also highlights the importance of understanding the limitations of Transformers in tasks requiring precise information processing, such as counting, and suggests that improvements in model architecture and training methods could help address these challenges. The findings contribute to the broader understanding of the limitations of Transformers in language tasks and provide insights for future research and practical applications.Transformers used in large language models (LLMs) face challenges in accurately representing sequences, leading to representational collapse and over-squashing. This paper analyzes how information propagates in decoder-only Transformers, focusing on the representation of the last token in the final layer, which is used for next-token prediction. The study reveals that certain distinct input sequences can produce arbitrarily close representations, causing the model to fail in tasks like counting and copying. This issue is exacerbated by low-precision floating-point formats commonly used in LLMs.
The paper also identifies over-squashing as a phenomenon where information from earlier tokens in the sequence has more paths to reach the final token representation, leading to information loss. This is similar to over-squashing in graph neural networks (GNNs) and relates to vanishing gradients. Theoretical analysis shows that the representational collapse is due to the limited precision of floating-point numbers and the structure of the Transformer's attention mechanism.
Empirical experiments on contemporary LLMs, such as Gemini 1.5 and Gemma, demonstrate these issues in tasks like copying and counting. For example, Gemini 1.5 fails to correctly copy the last token in a sequence of ones when the sequence length is large, and struggles with counting tasks as the sequence length increases. The results show that the model's ability to distinguish between sequences is limited by the precision of the representations.
The paper proposes solutions to mitigate these issues, such as introducing additional tokens in the sequence to keep representations distinct. It also highlights the importance of understanding the limitations of Transformers in tasks requiring precise information processing, such as counting, and suggests that improvements in model architecture and training methods could help address these challenges. The findings contribute to the broader understanding of the limitations of Transformers in language tasks and provide insights for future research and practical applications.