Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers

Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers

August 2015 | Changsheng Quan, Xiaofei Li
This paper proposes an online SpatialNet for long-term streaming speech enhancement in both static and moving speaker scenarios. The proposed method extends the previously proposed offline SpatialNet, which uses spatial information to distinguish between target speech and interferences. The core of SpatialNet is a narrow-band self-attention module for learning temporal dynamics of spatial vectors. For long-term streaming, the offline self-attention network is replaced with online networks with linear complexity w.r.t. signal length, maintaining the ability to learn long-term information. Three variants are developed: masked self-attention, Retention, and Mamba. Additionally, a short-signal training plus long-signal fine-tuning strategy is proposed to improve length extrapolation ability. The proposed online SpatialNet achieves outstanding performance for long audio streams and both static and moving speakers. The method is open-sourced at https://github.com/Audio-WestlakeU/NBSS. The paper also compares different training strategies and models, showing that the proposed ST+LF strategy is efficient in terms of training cost and performance. The results show that the proposed online SpatialNet outperforms baseline methods in speech enhancement. The Mamba variant performs best among the three. The paper concludes that the proposed online SpatialNet, especially the Mamba variant, achieves outstanding performance for long-term streaming speech enhancement in both static and moving speaker scenarios.This paper proposes an online SpatialNet for long-term streaming speech enhancement in both static and moving speaker scenarios. The proposed method extends the previously proposed offline SpatialNet, which uses spatial information to distinguish between target speech and interferences. The core of SpatialNet is a narrow-band self-attention module for learning temporal dynamics of spatial vectors. For long-term streaming, the offline self-attention network is replaced with online networks with linear complexity w.r.t. signal length, maintaining the ability to learn long-term information. Three variants are developed: masked self-attention, Retention, and Mamba. Additionally, a short-signal training plus long-signal fine-tuning strategy is proposed to improve length extrapolation ability. The proposed online SpatialNet achieves outstanding performance for long audio streams and both static and moving speakers. The method is open-sourced at https://github.com/Audio-WestlakeU/NBSS. The paper also compares different training strategies and models, showing that the proposed ST+LF strategy is efficient in terms of training cost and performance. The results show that the proposed online SpatialNet outperforms baseline methods in speech enhancement. The Mamba variant performs best among the three. The paper concludes that the proposed online SpatialNet, especially the Mamba variant, achieves outstanding performance for long-term streaming speech enhancement in both static and moving speaker scenarios.
Reach us at info@study.space
Understanding Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers