Theoretical Foundations of Deep Selective State-Space Models

Theoretical Foundations of Deep Selective State-Space Models

4 Mar 2024 | Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, Terry Lyons
This paper presents the theoretical foundations of deep selective state-space models (SSMs), focusing on their ability to model sequential data efficiently. The authors show that when linear recurrences in SSMs are equipped with input-controlled transitions (selectivity mechanisms), the hidden state becomes a low-dimensional projection of the input's signature, capturing nonlinear interactions across different timescales. This theoretical insight supports the success of modern selective SSMs like Mamba, which outperform attention-based transformers in terms of accuracy and efficiency, especially for long sequences. The paper introduces a framework for analyzing input-controlled linear recurrences, such as S4, Mamba, and GLA, using tools from Rough Path Theory. It demonstrates that wide, randomly initialized dense input-controlled linear recurrences are fully expressive, allowing the hidden state to approximate any continuous function from the input sequence to a target value. This is in contrast to S4, where the hidden state is a simple convolution of the input sequence with a fixed kernel. The authors also show that diagonal input-controlled linear recurrences, such as Mamba, collect input statistics more efficiently than S4. By chaining such blocks with linear pointwise maps, higher-order global statistics can be computed, matching the expressive power of dense linear input-controlled settings. The paper further explores the expressivity of Linear Controlled Differential Equations (CDEs), showing that they can approximate a wide range of functions. It proves that random initialization of matrices in Linear CDEs allows for universal approximation, with only the final readout needing training. This aligns with the reservoir computing paradigm. The study also compares S4 and Mamba, showing that Mamba's input-dependent transitions allow it to capture higher-order statistics and achieve better reasoning power than S4. The paper concludes that while diagonal structures in SSMs reduce expressivity, chaining these models can restore their expressive capabilities. Empirical validation on toy datasets supports the theoretical findings, showing that models like Mamba and S5 outperform linear CDEs in capturing high-order statistics. The results highlight the importance of input-controlled transitions in achieving expressive power in SSMs, and demonstrate that these models can approximate general path-to-path functions without offloading all complexity to neural networks.This paper presents the theoretical foundations of deep selective state-space models (SSMs), focusing on their ability to model sequential data efficiently. The authors show that when linear recurrences in SSMs are equipped with input-controlled transitions (selectivity mechanisms), the hidden state becomes a low-dimensional projection of the input's signature, capturing nonlinear interactions across different timescales. This theoretical insight supports the success of modern selective SSMs like Mamba, which outperform attention-based transformers in terms of accuracy and efficiency, especially for long sequences. The paper introduces a framework for analyzing input-controlled linear recurrences, such as S4, Mamba, and GLA, using tools from Rough Path Theory. It demonstrates that wide, randomly initialized dense input-controlled linear recurrences are fully expressive, allowing the hidden state to approximate any continuous function from the input sequence to a target value. This is in contrast to S4, where the hidden state is a simple convolution of the input sequence with a fixed kernel. The authors also show that diagonal input-controlled linear recurrences, such as Mamba, collect input statistics more efficiently than S4. By chaining such blocks with linear pointwise maps, higher-order global statistics can be computed, matching the expressive power of dense linear input-controlled settings. The paper further explores the expressivity of Linear Controlled Differential Equations (CDEs), showing that they can approximate a wide range of functions. It proves that random initialization of matrices in Linear CDEs allows for universal approximation, with only the final readout needing training. This aligns with the reservoir computing paradigm. The study also compares S4 and Mamba, showing that Mamba's input-dependent transitions allow it to capture higher-order statistics and achieve better reasoning power than S4. The paper concludes that while diagonal structures in SSMs reduce expressivity, chaining these models can restore their expressive capabilities. Empirical validation on toy datasets supports the theoretical findings, showing that models like Mamba and S5 outperform linear CDEs in capturing high-order statistics. The results highlight the importance of input-controlled transitions in achieving expressive power in SSMs, and demonstrate that these models can approximate general path-to-path functions without offloading all complexity to neural networks.
Reach us at info@study.space