MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

23 Jul 2024 | Ali Behrouz, Michele Santacatterina, Ramin Zabih
MambaMixer is an efficient selective state space model that enables dual token and channel selection. The model uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer, to efficiently select and mix (or filter) informative (or irrelevant) tokens and channels. MambaMixer further connects sequential selective mixers using a weighted averaging mechanism, allowing layers to have direct access to different layers' input and output. As a proof of concept, the authors design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on MambaMixer block and explore their performance in various vision and time series forecasting tasks. ViM2 achieves competitive performance with well-established vision models, i.e., ViT, MLP-Mixer, ConvMixer, and outperforms SSM-based vision models, i.e., ViM and VMamba. In time series forecasting, TSM2, an attention and MLP-free architecture, achieves outstanding performance compared to state-of-the-art methods while demonstrating significantly improved computational cost. These results show that while Transformers, cross-channel attention, and cross-channel MLPs are sufficient for good performance in practice, neither is necessary. The MambaMixer block uses selective state space models (S6 blocks) in both token and channel directions, enabling the model to effectively and selectively fuse information across both of these dimensions. To enhance the information flow and capturing the complex dynamics of features, MambaMixer uses a learnable weighted averaging mechanism on early features, which allows each block to directly access early features. The authors further present ViM2 and TSM2 models based on MambaMixer block for vision and time series forecasting tasks. Their experimental evaluations show that ViM2 achieves competitive performance with well-established vision models in ImageNet classification, object detection, and semantic segmentation tasks, and outperforms SSM-based vision models. In time series forecasting, TSM2 outperforms all the baselines in most datasets and achieves state-of-the-art performance while demonstrating significantly improved computational cost.MambaMixer is an efficient selective state space model that enables dual token and channel selection. The model uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer, to efficiently select and mix (or filter) informative (or irrelevant) tokens and channels. MambaMixer further connects sequential selective mixers using a weighted averaging mechanism, allowing layers to have direct access to different layers' input and output. As a proof of concept, the authors design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on MambaMixer block and explore their performance in various vision and time series forecasting tasks. ViM2 achieves competitive performance with well-established vision models, i.e., ViT, MLP-Mixer, ConvMixer, and outperforms SSM-based vision models, i.e., ViM and VMamba. In time series forecasting, TSM2, an attention and MLP-free architecture, achieves outstanding performance compared to state-of-the-art methods while demonstrating significantly improved computational cost. These results show that while Transformers, cross-channel attention, and cross-channel MLPs are sufficient for good performance in practice, neither is necessary. The MambaMixer block uses selective state space models (S6 blocks) in both token and channel directions, enabling the model to effectively and selectively fuse information across both of these dimensions. To enhance the information flow and capturing the complex dynamics of features, MambaMixer uses a learnable weighted averaging mechanism on early features, which allows each block to directly access early features. The authors further present ViM2 and TSM2 models based on MambaMixer block for vision and time series forecasting tasks. Their experimental evaluations show that ViM2 achieves competitive performance with well-established vision models in ImageNet classification, object detection, and semantic segmentation tasks, and outperforms SSM-based vision models. In time series forecasting, TSM2 outperforms all the baselines in most datasets and achieves state-of-the-art performance while demonstrating significantly improved computational cost.
Reach us at info@study.space