18 Feb 2020 | Nikita Kitaev*, Lukasz Kaiser*, Anselm Levskaya
The paper "Reformer: The Efficient Transformer" introduces two techniques to improve the efficiency of Transformers, particularly for long sequences. The first technique replaces dot-product attention with locality-sensitive hashing (LSH), reducing its complexity from O(L²) to O(L log L), where L is the sequence length. The second technique uses reversible residual layers, which allow storing activations only once during training, instead of N times as in standard ResNets. These improvements enable the Reformer model to perform on par with standard Transformers while being more memory-efficient and faster on long sequences. The authors demonstrate the effectiveness of these techniques through experiments on synthetic and real-world tasks, showing that Reformer can handle large models and long sequences efficiently.The paper "Reformer: The Efficient Transformer" introduces two techniques to improve the efficiency of Transformers, particularly for long sequences. The first technique replaces dot-product attention with locality-sensitive hashing (LSH), reducing its complexity from O(L²) to O(L log L), where L is the sequence length. The second technique uses reversible residual layers, which allow storing activations only once during training, instead of N times as in standard ResNets. These improvements enable the Reformer model to perform on par with standard Transformers while being more memory-efficient and faster on long sequences. The authors demonstrate the effectiveness of these techniques through experiments on synthetic and real-world tasks, showing that Reformer can handle large models and long sequences efficiently.