Time domain speech enhancement with CNN and time-attention transformer

Time domain speech enhancement with CNN and time-attention transformer

2024 | Saleem, N.ªᵇ, Gunawan, T.S.ᵇᶜ, Dhahbi, S.ᵈ, Bourouis, S.ᵉ
This paper proposes a speech enhancement model using a convolutional encoder-decoder framework with 1D-time domain dilated residual blocks and a time-attention transformer (TAT) bottleneck. The TAT model incorporates a time-attention mechanism to selectively attend to different segments of the speech signal over time, enabling the model to capture long-term dependencies and learn important features. The model is evaluated using the WSJ0 SI-84 database, showing significant improvements in speech intelligibility and quality compared to recent deep neural networks (DNNs). The proposed method improves the STOI by 21.51% and PESQ by 1.14 over noisy speech. The study highlights the effectiveness of deep learning approaches in time-domain speech enhancement, particularly in preserving speech quality and intelligibility in noisy environments. The paper also references several previous works on speech enhancement, including traditional methods like spectral subtraction and modern deep learning techniques such as generative adversarial networks (GANs), transformers, and attention mechanisms. These studies collectively demonstrate the ongoing research and development in improving speech enhancement technologies, focusing on both algorithmic innovation and practical applications. The results indicate that the proposed model outperforms existing methods, offering a promising solution for time-domain speech enhancement.This paper proposes a speech enhancement model using a convolutional encoder-decoder framework with 1D-time domain dilated residual blocks and a time-attention transformer (TAT) bottleneck. The TAT model incorporates a time-attention mechanism to selectively attend to different segments of the speech signal over time, enabling the model to capture long-term dependencies and learn important features. The model is evaluated using the WSJ0 SI-84 database, showing significant improvements in speech intelligibility and quality compared to recent deep neural networks (DNNs). The proposed method improves the STOI by 21.51% and PESQ by 1.14 over noisy speech. The study highlights the effectiveness of deep learning approaches in time-domain speech enhancement, particularly in preserving speech quality and intelligibility in noisy environments. The paper also references several previous works on speech enhancement, including traditional methods like spectral subtraction and modern deep learning techniques such as generative adversarial networks (GANs), transformers, and attention mechanisms. These studies collectively demonstrate the ongoing research and development in improving speech enhancement technologies, focusing on both algorithmic innovation and practical applications. The results indicate that the proposed model outperforms existing methods, offering a promising solution for time-domain speech enhancement.
Reach us at info@study.space
[slides] Time domain speech enhancement with CNN and time-attention transformer | StudySpace