4 Mar 2024 | Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, Terry Lyons
This paper provides a theoretical foundation for the recent findings that structured state-space models (SSMs) can outperform attention-based transformers in various domains, particularly in handling long-range reasoning tasks. The authors use tools from Rough Path Theory to show that when linear recurrences in SSMs are equipped with input-controlled transitions, the hidden state captures non-linear interactions between tokens at different time scales. This captures the essence of modern selective SSMs like Mamba, which achieve state-of-the-art performance in language modeling with significantly less computational cost compared to transformers. The paper also proves that wide enough linear recurrences are fully expressive, and that diagonal input-controlled recurrences collect input statistics more efficiently than non-diagonal ones. Additionally, it demonstrates how chaining these blocks allows for the computation of higher-order global statistics, matching the expressive power of dense linear recurrences. The theoretical framework not only explains the success of selective SSMs but also provides a solid basis for understanding and designing future SSM variants.This paper provides a theoretical foundation for the recent findings that structured state-space models (SSMs) can outperform attention-based transformers in various domains, particularly in handling long-range reasoning tasks. The authors use tools from Rough Path Theory to show that when linear recurrences in SSMs are equipped with input-controlled transitions, the hidden state captures non-linear interactions between tokens at different time scales. This captures the essence of modern selective SSMs like Mamba, which achieve state-of-the-art performance in language modeling with significantly less computational cost compared to transformers. The paper also proves that wide enough linear recurrences are fully expressive, and that diagonal input-controlled recurrences collect input statistics more efficiently than non-diagonal ones. Additionally, it demonstrates how chaining these blocks allows for the computation of higher-order global statistics, matching the expressive power of dense linear recurrences. The theoretical framework not only explains the success of selective SSMs but also provides a solid basis for understanding and designing future SSM variants.