8 Jan 2021 | Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed
BIGBIRD is a sparse attention mechanism that reduces the quadratic memory complexity of traditional transformers to linear, enabling the processing of much longer sequences. The model is theoretically as expressive as full attention mechanisms and is Turing complete, preserving key properties of the original model. BIGBIRD incorporates three components: global tokens that attend to the entire sequence, local neighboring tokens, and random tokens. This approach allows the model to handle sequences up to 8 times longer than previously possible with similar hardware. BIGBIRD significantly improves performance on various NLP tasks, including question answering and summarization. It also introduces novel applications in genomics, where it enhances performance on tasks like promoter region prediction and chromatin profile prediction. The model's sparse attention mechanism is efficient and effective, achieving state-of-the-art results on multiple benchmarks. Theoretical analysis shows that sparse attention mechanisms can approximate sequence functions and are Turing complete, though they may require more layers for certain tasks. Empirical results demonstrate that BIGBIRD outperforms other models in NLP tasks and provides improved performance in genomics applications.BIGBIRD is a sparse attention mechanism that reduces the quadratic memory complexity of traditional transformers to linear, enabling the processing of much longer sequences. The model is theoretically as expressive as full attention mechanisms and is Turing complete, preserving key properties of the original model. BIGBIRD incorporates three components: global tokens that attend to the entire sequence, local neighboring tokens, and random tokens. This approach allows the model to handle sequences up to 8 times longer than previously possible with similar hardware. BIGBIRD significantly improves performance on various NLP tasks, including question answering and summarization. It also introduces novel applications in genomics, where it enhances performance on tasks like promoter region prediction and chromatin profile prediction. The model's sparse attention mechanism is efficient and effective, achieving state-of-the-art results on multiple benchmarks. Theoretical analysis shows that sparse attention mechanisms can approximate sequence functions and are Turing complete, though they may require more layers for certain tasks. Empirical results demonstrate that BIGBIRD outperforms other models in NLP tasks and provides improved performance in genomics applications.