[slides and audio] Mamba in Speech%3A Towards an Alternative to Self-Attention

The paper "Mamba in Speech: Towards an Alternative to Self-Attention" explores the application of Mamba, a Selective State Space Model, in speech processing tasks, particularly speech recognition and speech enhancement. Mamba is proposed as an alternative to the multi-head self-attention (MHSA) mechanism in Transformers, which is known for its computational complexity. The authors introduce two bidirectional Mamba (BiMamba) designs—external (ExtBiMamba) and internal (InnBiMamba)—to enhance the model's ability to capture global dependencies in speech signals. Extensive experiments on various datasets demonstrate that BiMamba outperforms vanilla Mamba and achieves superior results compared to Transformer and Conformer models in speech enhancement tasks. Additionally, BiMamba is shown to be effective in speech recognition tasks, where it replaces MHSA modules in Transformer and Conformer models, improving performance and reducing computational complexity. The study also highlights the importance of nonlinearity in capturing high-level semantic information, which is crucial for tasks like speech recognition. Ablation studies further validate the effectiveness of BiMamba in speech processing, confirming its potential as a viable alternative to self-attention mechanisms.The paper "Mamba in Speech: Towards an Alternative to Self-Attention" explores the application of Mamba, a Selective State Space Model, in speech processing tasks, particularly speech recognition and speech enhancement. Mamba is proposed as an alternative to the multi-head self-attention (MHSA) mechanism in Transformers, which is known for its computational complexity. The authors introduce two bidirectional Mamba (BiMamba) designs—external (ExtBiMamba) and internal (InnBiMamba)—to enhance the model's ability to capture global dependencies in speech signals. Extensive experiments on various datasets demonstrate that BiMamba outperforms vanilla Mamba and achieves superior results compared to Transformer and Conformer models in speech enhancement tasks. Additionally, BiMamba is shown to be effective in speech recognition tasks, where it replaces MHSA modules in Transformer and Conformer models, improving performance and reducing computational complexity. The study also highlights the importance of nonlinearity in capturing high-level semantic information, which is crucial for tasks like speech recognition. Ablation studies further validate the effectiveness of BiMamba in speech processing, confirming its potential as a viable alternative to self-attention mechanisms.

Mamba in Speech: Towards an Alternative to Self-Attention

1 Jul 2024 | Xiangyu Zhang, Qiquan Zhang, Hexin Liu, Tianyi Xiao, Xinyuan Qian, Beena Ahmed, Eliathambam Ambikairajah, Haizhou Li, Julien Epps