7 May 2024 | Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Günter Klambauer, Johannes Brandstetter, Michael Kopp, Sepp Hochreiter
The paper introduces xLSTM, an extended version of the Long Short-Term Memory (LSTM) architecture, designed to overcome the limitations of traditional LSTMs in language modeling. The key contributions of xLSTM include:
1. **Exponential Gating**: This technique allows for more flexible and efficient memory management by introducing exponential activation functions in the input and forget gates, along with normalization and stabilization techniques.
2. **Modified Memory Structures**:
- **sLSTM**: Introduces a scalar memory and update mechanism, along with a new memory mixing technique.
- **mLSTM**: Enhances storage capacity by using a matrix memory and a covariance update rule, making it fully parallelizable.
3. **xLSTM Architecture**: xLSTM blocks are built by integrating these modified LSTM variants into residual blocks, which are then stacked to form xLSTM architectures. This design leverages the benefits of both exponential gating and parallelizable memory structures.
The paper evaluates xLSTM's performance in various tasks, including formal language tasks, associative recall tasks, and long sequence processing. Results show that xLSTM outperforms existing methods like Transformers and State Space Models in language modeling, demonstrating its potential to compete with current Large Language Models (LLMs) in terms of both performance and scalability. The scaling laws indicate that larger xLSTM models will continue to perform well compared to Transformers and State Space Models. The paper concludes by discussing the limitations of xLSTM and its potential impact on other deep learning fields.The paper introduces xLSTM, an extended version of the Long Short-Term Memory (LSTM) architecture, designed to overcome the limitations of traditional LSTMs in language modeling. The key contributions of xLSTM include:
1. **Exponential Gating**: This technique allows for more flexible and efficient memory management by introducing exponential activation functions in the input and forget gates, along with normalization and stabilization techniques.
2. **Modified Memory Structures**:
- **sLSTM**: Introduces a scalar memory and update mechanism, along with a new memory mixing technique.
- **mLSTM**: Enhances storage capacity by using a matrix memory and a covariance update rule, making it fully parallelizable.
3. **xLSTM Architecture**: xLSTM blocks are built by integrating these modified LSTM variants into residual blocks, which are then stacked to form xLSTM architectures. This design leverages the benefits of both exponential gating and parallelizable memory structures.
The paper evaluates xLSTM's performance in various tasks, including formal language tasks, associative recall tasks, and long sequence processing. Results show that xLSTM outperforms existing methods like Transformers and State Space Models in language modeling, demonstrating its potential to compete with current Large Language Models (LLMs) in terms of both performance and scalability. The scaling laws indicate that larger xLSTM models will continue to perform well compared to Transformers and State Space Models. The paper concludes by discussing the limitations of xLSTM and its potential impact on other deep learning fields.