[slides] Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers

This paper presents an extension of the offline SpatialNet for long-term streaming multichannel speech enhancement in both static and moving speaker scenarios. The core of SpatialNet is a narrow-band self-attention module that learns the temporal dynamics of spatial vectors. To enable online processing, the offline self-attention network is replaced with three variants: masked self-attention (MSA), Retention, and Mamba. These variants maintain linear inference complexity and long-term information learning while being causal. The authors also propose a short-signal training plus long-signal fine-tuning strategy to improve length extrapolation ability. The proposed online SpatialNet achieves outstanding speech enhancement performance for long audio streams and both static and moving speakers. The method is open-sourced and evaluated on simulated datasets with static and moving speakers, demonstrating superior performance compared to baseline methods.This paper presents an extension of the offline SpatialNet for long-term streaming multichannel speech enhancement in both static and moving speaker scenarios. The core of SpatialNet is a narrow-band self-attention module that learns the temporal dynamics of spatial vectors. To enable online processing, the offline self-attention network is replaced with three variants: masked self-attention (MSA), Retention, and Mamba. These variants maintain linear inference complexity and long-term information learning while being causal. The authors also propose a short-signal training plus long-signal fine-tuning strategy to improve length extrapolation ability. The proposed online SpatialNet achieves outstanding speech enhancement performance for long audio streams and both static and moving speakers. The method is open-sourced and evaluated on simulated datasets with static and moving speakers, demonstrating superior performance compared to baseline methods.

Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers

19 Jun 2024 | Changsheng Quan, Xiaofei Li