AN INVESTIGATION OF INCORPORATING MAMBA FOR SPEECH ENHANCEMENT

AN INVESTIGATION OF INCORPORATING MAMBA FOR SPEECH ENHANCEMENT

10 May 2024 | Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao
This paper investigates the application of Mamba, a scalable state-space model, for speech enhancement (SE). The authors propose SEMamba, a speech enhancement system based on Mamba, and evaluate its performance on the VoiceBank-DEMAND dataset. SEMamba demonstrates promising results, achieving a PESQ score of 3.55 without perceptual contrast stretching (PCS) and 3.69 with PCS, which is the state-of-the-art score on this dataset. The system is evaluated using both basic and advanced architectures. The basic architecture, SEMamba-basic, uses a causal model and processes the input through a convolutional encoder, a uni-directional Mamba block, and a fully connected decoder. The advanced architecture, SEMamba-advanced, integrates components from MP-SENet and uses a Time-Frequency Mamba block to enhance spectral properties. The system also incorporates a consistency loss (CL) to improve training stability and PCS to enhance perceptual quality. The experiments show that Mamba achieves comparable or superior performance to Transformer-based models, with fewer FLOPs and parameters. The results indicate that Mamba performs better with advanced architectures, achieving higher PESQ scores with lower computational costs. The study concludes that Mamba holds significant promise for advancing SE performance and explores its potential in other speech generation tasks.This paper investigates the application of Mamba, a scalable state-space model, for speech enhancement (SE). The authors propose SEMamba, a speech enhancement system based on Mamba, and evaluate its performance on the VoiceBank-DEMAND dataset. SEMamba demonstrates promising results, achieving a PESQ score of 3.55 without perceptual contrast stretching (PCS) and 3.69 with PCS, which is the state-of-the-art score on this dataset. The system is evaluated using both basic and advanced architectures. The basic architecture, SEMamba-basic, uses a causal model and processes the input through a convolutional encoder, a uni-directional Mamba block, and a fully connected decoder. The advanced architecture, SEMamba-advanced, integrates components from MP-SENet and uses a Time-Frequency Mamba block to enhance spectral properties. The system also incorporates a consistency loss (CL) to improve training stability and PCS to enhance perceptual quality. The experiments show that Mamba achieves comparable or superior performance to Transformer-based models, with fewer FLOPs and parameters. The results indicate that Mamba performs better with advanced architectures, achieving higher PESQ scores with lower computational costs. The study concludes that Mamba holds significant promise for advancing SE performance and explores its potential in other speech generation tasks.
Reach us at info@study.space
[slides and audio] An Investigation of Incorporating Mamba For Speech Enhancement