On Training Targets for Supervised Speech Separation

On Training Targets for Supervised Speech Separation

2014 December ; 22(12): 1849–1858 | Yuxuan Wang, Arun Narayanan, DeLiang Wang
This paper evaluates and compares different training targets for supervised speech separation, including the Ideal Binary Mask (IBM), Target Binary Mask (TBM), Ideal Ratio Mask (IRM), Short-Time Fourier Transform Spectral Magnitude (FFT-MAG), Short-Time Fourier Transform Spectral Magnitude Mask (FFT-MASK), and Gammatone Frequency Power Spectrum (GF-POW). The study uses a deep neural network (DNN) framework to train these targets and evaluates their performance in various test conditions using objective intelligibility and quality metrics. The results show that ratio mask targets (IRM and FFT-MASK) outperform other targets in terms of objective intelligibility and quality. Additionally, masking-based targets generally perform better than spectral envelope-based targets. The paper also compares supervised speech separation with non-negative matrix factorization (NMF) and speech enhancement methods, demonstrating the superior performance of supervised techniques. The study concludes by highlighting the importance of choosing appropriate training targets and suggests future research directions, such as designing new training targets to further improve performance.This paper evaluates and compares different training targets for supervised speech separation, including the Ideal Binary Mask (IBM), Target Binary Mask (TBM), Ideal Ratio Mask (IRM), Short-Time Fourier Transform Spectral Magnitude (FFT-MAG), Short-Time Fourier Transform Spectral Magnitude Mask (FFT-MASK), and Gammatone Frequency Power Spectrum (GF-POW). The study uses a deep neural network (DNN) framework to train these targets and evaluates their performance in various test conditions using objective intelligibility and quality metrics. The results show that ratio mask targets (IRM and FFT-MASK) outperform other targets in terms of objective intelligibility and quality. Additionally, masking-based targets generally perform better than spectral envelope-based targets. The paper also compares supervised speech separation with non-negative matrix factorization (NMF) and speech enhancement methods, demonstrating the superior performance of supervised techniques. The study concludes by highlighting the importance of choosing appropriate training targets and suggests future research directions, such as designing new training targets to further improve performance.
Reach us at info@study.space