This paper explores the application of Mamba, a selective state space model, in speech processing tasks, particularly speech recognition and speech enhancement. Mamba, originally proposed as an alternative to self-attention in Transformer models, has shown effectiveness in natural language processing and computer vision but has not been thoroughly evaluated in speech processing. The study investigates the performance of bidirectional Mamba (BiMamba) in speech tasks, comparing it with vanilla Mamba and self-attention mechanisms.
The research introduces two bidirectional Mamba variants: inner bidirectional Mamba (InnBiMamba) and external bidirectional Mamba (ExtBiMamba). InnBiMamba uses shared input and output projection layers, while ExtBiMamba employs separate layers for forward and backward processing. Experiments show that BiMamba outperforms vanilla Mamba in speech processing tasks, particularly in capturing global dependencies and semantic information.
The study evaluates Mamba and BiMamba in speech enhancement and recognition tasks using various datasets. Results indicate that BiMamba achieves superior performance compared to self-attention mechanisms in both tasks. In speech enhancement, BiMamba models outperform traditional models in metrics such as PESQ, ESTOI, CSIG, CBAK, and COVL. In speech recognition, BiMamba models show significant improvements in word error rate (WER) and mixed word error rate (MER) compared to Transformer and Conformer models.
The paper also discusses the effectiveness of BiMamba in capturing high-level semantic information through additional nonlinearity, making it a suitable replacement for self-attention in speech processing. The study highlights the importance of bidirectional modeling in speech tasks, demonstrating that BiMamba provides better performance than unidirectional models. The results suggest that BiMamba is a promising alternative to self-attention in speech processing, particularly for tasks requiring high-level semantic information.This paper explores the application of Mamba, a selective state space model, in speech processing tasks, particularly speech recognition and speech enhancement. Mamba, originally proposed as an alternative to self-attention in Transformer models, has shown effectiveness in natural language processing and computer vision but has not been thoroughly evaluated in speech processing. The study investigates the performance of bidirectional Mamba (BiMamba) in speech tasks, comparing it with vanilla Mamba and self-attention mechanisms.
The research introduces two bidirectional Mamba variants: inner bidirectional Mamba (InnBiMamba) and external bidirectional Mamba (ExtBiMamba). InnBiMamba uses shared input and output projection layers, while ExtBiMamba employs separate layers for forward and backward processing. Experiments show that BiMamba outperforms vanilla Mamba in speech processing tasks, particularly in capturing global dependencies and semantic information.
The study evaluates Mamba and BiMamba in speech enhancement and recognition tasks using various datasets. Results indicate that BiMamba achieves superior performance compared to self-attention mechanisms in both tasks. In speech enhancement, BiMamba models outperform traditional models in metrics such as PESQ, ESTOI, CSIG, CBAK, and COVL. In speech recognition, BiMamba models show significant improvements in word error rate (WER) and mixed word error rate (MER) compared to Transformer and Conformer models.
The paper also discusses the effectiveness of BiMamba in capturing high-level semantic information through additional nonlinearity, making it a suitable replacement for self-attention in speech processing. The study highlights the importance of bidirectional modeling in speech tasks, demonstrating that BiMamba provides better performance than unidirectional models. The results suggest that BiMamba is a promising alternative to self-attention in speech processing, particularly for tasks requiring high-level semantic information.