[slides and audio] Time domain speech enhancement with CNN and time-attention transformer

This article presents a novel approach for time-domain speech enhancement using a combination of convolutional encoder-decoder networks and a time-attention transformer (TAT). The proposed model employs 1D-time domain dilated residual blocks in the encoder-decoder framework to capture contextual information and detailed features of the input speech. Additionally, a TAT bottleneck is integrated to enable the model to selectively attend to different segments of the speech signal over time, thereby capturing long-term dependencies and important features. The experimental results demonstrate that the proposed model outperforms recent deep neural networks (DNNs) in enhancing the quality and intelligibility of noisy speech. Using the WSJ0 SI-84 database, the model improves the STOI and PESQ metrics by 21.51% and 1.14, respectively, compared to noisy speech. The study highlights the effectiveness of combining convolutional networks with attention mechanisms in time-domain speech enhancement, offering a resource-efficient solution that maintains high performance. The research contributes to the field by addressing the challenge of maintaining contextual information while achieving efficient processing in the time domain.This article presents a novel approach for time-domain speech enhancement using a combination of convolutional encoder-decoder networks and a time-attention transformer (TAT). The proposed model employs 1D-time domain dilated residual blocks in the encoder-decoder framework to capture contextual information and detailed features of the input speech. Additionally, a TAT bottleneck is integrated to enable the model to selectively attend to different segments of the speech signal over time, thereby capturing long-term dependencies and important features. The experimental results demonstrate that the proposed model outperforms recent deep neural networks (DNNs) in enhancing the quality and intelligibility of noisy speech. Using the WSJ0 SI-84 database, the model improves the STOI and PESQ metrics by 21.51% and 1.14, respectively, compared to noisy speech. The study highlights the effectiveness of combining convolutional networks with attention mechanisms in time-domain speech enhancement, offering a resource-efficient solution that maintains high performance. The research contributes to the field by addressing the challenge of maintaining contextual information while achieving efficient processing in the time domain.

Time domain speech enhancement with CNN and time-attention transformer

2024 | Saleem, N., Gunawan, T.S., Dhahbi, S., Bourouis, S.