2018 October ; 26(10): 1702–1726. doi:10.1109/TASLP.2018.2842159. | DeLiang Wang [Fellow, IEEE] and Jitong Chen
This paper provides a comprehensive overview of deep learning-based supervised speech separation, a task aimed at separating target speech from background interference. The authors introduce the background of speech separation and the formulation of supervised separation, discussing three main components: learning machines, training targets, and acoustic features. They review monaural methods, including speech enhancement, speaker separation, and speech dereverberation, as well as multimicrophone techniques. The paper also addresses the issue of generalization unique to supervised learning. The authors highlight the advancements in speech separation due to deep learning, particularly in terms of performance and computational efficiency. They discuss various training targets, such as ideal binary masks, target binary masks, and spectral magnitude masks, and evaluate their effectiveness using metrics like STOI and PESQ. Additionally, they explore different features used in supervised speech separation, emphasizing the importance of feature selection. The paper concludes with a discussion on the generalization of speech enhancement algorithms and the potential of end-to-end separation approaches.This paper provides a comprehensive overview of deep learning-based supervised speech separation, a task aimed at separating target speech from background interference. The authors introduce the background of speech separation and the formulation of supervised separation, discussing three main components: learning machines, training targets, and acoustic features. They review monaural methods, including speech enhancement, speaker separation, and speech dereverberation, as well as multimicrophone techniques. The paper also addresses the issue of generalization unique to supervised learning. The authors highlight the advancements in speech separation due to deep learning, particularly in terms of performance and computational efficiency. They discuss various training targets, such as ideal binary masks, target binary masks, and spectral magnitude masks, and evaluate their effectiveness using metrics like STOI and PESQ. Additionally, they explore different features used in supervised speech separation, emphasizing the importance of feature selection. The paper concludes with a discussion on the generalization of speech enhancement algorithms and the potential of end-to-end separation approaches.