7 May 2024 | Dongseong Hwang, Weiran Wang, Zhuoyuan Huo, Khe Chai Sim, Pedro Mengibar
TransformerFAM: Feedback Attention is Working Memory
**Dongseong Hwang, Weiran Wang, Zhuoyuan Huo, Khe Chai Sim, Pedro Mengibar**
Google LLC
Mountain View, CA, USA
dongseong@google.com
**Abstract**
While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length.
**Introduction**
The introduction of the Transformer architecture has revolutionized deep learning by permeating diverse domains and enhancing performance due to its efficacy and scalability. However, attention has quadratic complexity with respect to context length, which limits the capability of modeling long contexts. We hypothesize that attending to the feedback loop can give rise to working memory in Transformers. Our proposed architecture, TransformerFAM, enables attention to both homogeneous sequence data and latent representations via a feedback loop, fostering the natural emergence of working memory within Transformers.
**TransformerFAM**
TransformerFAM introduces a feedback mechanism between intermediate layers, allowing each layer to have a distributed working memory that corresponds to its abstraction level. The architecture changes include appending Feedback Attention Memory (FAM) to block segments and incorporating it into self-attention processes. This enables richer representations and dynamic propagation of global contextual information across blocks.
**Experiments**
We evaluate TransformerFAM on long-context tasks using Flan-PaLM models of 1B, 8B, and 24B sizes. TransformerFAM outperforms TransformerBSWA on all long-context tasks, demonstrating its effectiveness in compressing and retaining important contextual information within extremely long contexts.
**Conclusion**
TransformerFAM addresses the limitations of current Transformers by integrating attention-based working memory, a concept from neuroscience. Our goal is to ignite further research within the community to solve the ongoing challenge of limited memory in deep learning.TransformerFAM: Feedback Attention is Working Memory
**Dongseong Hwang, Weiran Wang, Zhuoyuan Huo, Khe Chai Sim, Pedro Mengibar**
Google LLC
Mountain View, CA, USA
dongseong@google.com
**Abstract**
While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length.
**Introduction**
The introduction of the Transformer architecture has revolutionized deep learning by permeating diverse domains and enhancing performance due to its efficacy and scalability. However, attention has quadratic complexity with respect to context length, which limits the capability of modeling long contexts. We hypothesize that attending to the feedback loop can give rise to working memory in Transformers. Our proposed architecture, TransformerFAM, enables attention to both homogeneous sequence data and latent representations via a feedback loop, fostering the natural emergence of working memory within Transformers.
**TransformerFAM**
TransformerFAM introduces a feedback mechanism between intermediate layers, allowing each layer to have a distributed working memory that corresponds to its abstraction level. The architecture changes include appending Feedback Attention Memory (FAM) to block segments and incorporating it into self-attention processes. This enables richer representations and dynamic propagation of global contextual information across blocks.
**Experiments**
We evaluate TransformerFAM on long-context tasks using Flan-PaLM models of 1B, 8B, and 24B sizes. TransformerFAM outperforms TransformerBSWA on all long-context tasks, demonstrating its effectiveness in compressing and retaining important contextual information within extremely long contexts.
**Conclusion**
TransformerFAM addresses the limitations of current Transformers by integrating attention-based working memory, a concept from neuroscience. Our goal is to ignite further research within the community to solve the ongoing challenge of limited memory in deep learning.