Understanding Big Bird%3A Transformers for Longer Sequences

**Big Bird: Transformers for Longer Sequences** Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed Google Research {manzilz, gurug, avinavadubey}@google.com **Abstract** Transformers-based models, such as BERT, have been highly successful in Natural Language Processing (NLP). However, their performance is limited by the quadratic dependency on sequence length due to their full attention mechanism. To address this, we propose BIGBIRD, a sparse attention mechanism that reduces this dependency to linear. We show that BIGBIRD is a universal approximator of sequence functions and is Turing complete, preserving the expressive power of full attention models. Our theoretical analysis highlights the benefits of using $O(1)$ global tokens, which attend to the entire sequence. The proposed sparse attention mechanism can handle sequences up to 8 times longer than what was previously possible with similar hardware. As a result, BIGBIRD significantly improves performance on various NLP tasks, such as question answering and summarization. We also demonstrate its application to genomics data, where it improves performance on downstream tasks like promoter-region and chromatin profile prediction. **Introduction** Transformers, such as BERT, have become the dominant architecture in NLP due to their versatility and robustness. Their self-attention mechanism allows each token to attend independently to every other token, enabling parallel computation and leveraging modern hardware accelerators. However, this mechanism has a quadratic complexity in the sequence length, limiting its applicability to tasks requiring larger context. To overcome this limitation, we propose BIGBIRD, an attention mechanism that reduces complexity to linear. Inspired by graph sparsification methods, BIGBIRD consists of three main components: global tokens that attend to all parts of the sequence, local tokens that attend to neighboring tokens, and random tokens that attend to random tokens. This design allows BIGBIRD to handle longer sequences while maintaining computational efficiency. **Theoretical Results** We prove that sparse attention mechanisms are as powerful as full-attention mechanisms. Specifically, we show that sparse attention mechanisms are universal approximators of sequence-to-sequence functions and are Turing complete. We also demonstrate that moving to sparse attention mechanisms incurs a cost, as certain tasks require more layers compared to full-attention mechanisms. **Experiments** We evaluate BIGBIRD on various NLP tasks, including masked language modeling, question answering, and document classification. BIGBIRD consistently outperforms existing models, achieving state-of-the-art results on several datasets. We also apply BIGBIRD to genomics data, improving performance on tasks like promoter region prediction and chromatin profile prediction. **Conclusion** BIGBIRD is a sparse attention mechanism that linearizes the complexity of Transformers, enabling them to handle longer**Big Bird: Transformers for Longer Sequences** Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed Google Research {manzilz, gurug, avinavadubey}@google.com **Abstract** Transformers-based models, such as BERT, have been highly successful in Natural Language Processing (NLP). However, their performance is limited by the quadratic dependency on sequence length due to their full attention mechanism. To address this, we propose BIGBIRD, a sparse attention mechanism that reduces this dependency to linear. We show that BIGBIRD is a universal approximator of sequence functions and is Turing complete, preserving the expressive power of full attention models. Our theoretical analysis highlights the benefits of using $O(1)$ global tokens, which attend to the entire sequence. The proposed sparse attention mechanism can handle sequences up to 8 times longer than what was previously possible with similar hardware. As a result, BIGBIRD significantly improves performance on various NLP tasks, such as question answering and summarization. We also demonstrate its application to genomics data, where it improves performance on downstream tasks like promoter-region and chromatin profile prediction. **Introduction** Transformers, such as BERT, have become the dominant architecture in NLP due to their versatility and robustness. Their self-attention mechanism allows each token to attend independently to every other token, enabling parallel computation and leveraging modern hardware accelerators. However, this mechanism has a quadratic complexity in the sequence length, limiting its applicability to tasks requiring larger context. To overcome this limitation, we propose BIGBIRD, an attention mechanism that reduces complexity to linear. Inspired by graph sparsification methods, BIGBIRD consists of three main components: global tokens that attend to all parts of the sequence, local tokens that attend to neighboring tokens, and random tokens that attend to random tokens. This design allows BIGBIRD to handle longer sequences while maintaining computational efficiency. **Theoretical Results** We prove that sparse attention mechanisms are as powerful as full-attention mechanisms. Specifically, we show that sparse attention mechanisms are universal approximators of sequence-to-sequence functions and are Turing complete. We also demonstrate that moving to sparse attention mechanisms incurs a cost, as certain tasks require more layers compared to full-attention mechanisms. **Experiments** We evaluate BIGBIRD on various NLP tasks, including masked language modeling, question answering, and document classification. BIGBIRD consistently outperforms existing models, achieving state-of-the-art results on several datasets. We also apply BIGBIRD to genomics data, improving performance on tasks like promoter region prediction and chromatin profile prediction. **Conclusion** BIGBIRD is a sparse attention mechanism that linearizes the complexity of Transformers, enabling them to handle longer

Big Bird: Transformers for Longer Sequences

8 Jan 2021 | Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed