On Training Targets for Supervised Speech Separation

On Training Targets for Supervised Speech Separation

2014 December | Yuxuan Wang, Arun Narayanan, DeLiang Wang [Fellow, IEEE]
This paper investigates the effectiveness of various training targets for supervised speech separation. The study evaluates and compares the performance of different targets, including the ideal binary mask (IBM), target binary mask (TBM), ideal ratio mask (IRM), short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. The results show that the two ratio mask targets, IRM and FFT-MASK, outperform other targets in terms of objective intelligibility and quality metrics. Masking-based targets are found to be significantly better than spectral envelope-based targets. The study also compares supervised speech separation with recent methods in non-negative matrix factorization and speech enhancement, showing clear performance advantages of supervised speech separation. The paper presents a supervised speech separation framework based on deep neural networks (DNNs), which are trained to predict the desired outputs across all frequency bands. The DNNs use three hidden layers with 1024 rectified linear hidden units. The training targets are evaluated using various metrics, including the Short-Time Objective Intelligibility (STOI) score and the Perceptual Evaluation of Speech Quality (PESQ) score. The study also explores the impact of different normalization and compression techniques on the performance of FFT-MAG prediction. The results show that the IRM and FFT-MASK targets provide the best performance in terms of objective intelligibility and quality. The study also finds that the use of masking functions, such as FFT-MASK, leads to better performance than log magnitude compression. The analysis suggests that the errors in log magnitude compression are magnified when converting to the magnitude domain before resynthesis, which can negatively affect performance. The study also compares the performance of supervised speech separation with recent methods in non-negative matrix factorization and speech enhancement, showing that supervised speech separation achieves better results. The paper concludes that choosing an appropriate training target is critical for supervised learning, as it directly relates to the underlying computational goal. The study finds that ratio masking-based targets, such as IRM and FFT-MASK, outperform binary masking-based targets, such as IBM and TBM, in terms of objective intelligibility and quality. The study also highlights the advantages of masking-based targets over spectral envelope-based targets. The results suggest that supervised speech separation has significant potential for improving speech intelligibility and quality in noisy environments.This paper investigates the effectiveness of various training targets for supervised speech separation. The study evaluates and compares the performance of different targets, including the ideal binary mask (IBM), target binary mask (TBM), ideal ratio mask (IRM), short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. The results show that the two ratio mask targets, IRM and FFT-MASK, outperform other targets in terms of objective intelligibility and quality metrics. Masking-based targets are found to be significantly better than spectral envelope-based targets. The study also compares supervised speech separation with recent methods in non-negative matrix factorization and speech enhancement, showing clear performance advantages of supervised speech separation. The paper presents a supervised speech separation framework based on deep neural networks (DNNs), which are trained to predict the desired outputs across all frequency bands. The DNNs use three hidden layers with 1024 rectified linear hidden units. The training targets are evaluated using various metrics, including the Short-Time Objective Intelligibility (STOI) score and the Perceptual Evaluation of Speech Quality (PESQ) score. The study also explores the impact of different normalization and compression techniques on the performance of FFT-MAG prediction. The results show that the IRM and FFT-MASK targets provide the best performance in terms of objective intelligibility and quality. The study also finds that the use of masking functions, such as FFT-MASK, leads to better performance than log magnitude compression. The analysis suggests that the errors in log magnitude compression are magnified when converting to the magnitude domain before resynthesis, which can negatively affect performance. The study also compares the performance of supervised speech separation with recent methods in non-negative matrix factorization and speech enhancement, showing that supervised speech separation achieves better results. The paper concludes that choosing an appropriate training target is critical for supervised learning, as it directly relates to the underlying computational goal. The study finds that ratio masking-based targets, such as IRM and FFT-MASK, outperform binary masking-based targets, such as IBM and TBM, in terms of objective intelligibility and quality. The study also highlights the advantages of masking-based targets over spectral envelope-based targets. The results suggest that supervised speech separation has significant potential for improving speech intelligibility and quality in noisy environments.
Reach us at info@study.space
Understanding On Training Targets for Supervised Speech Separation