Deep clustering: Discriminative embeddings for segmentation and separation

Deep clustering: Discriminative embeddings for segmentation and separation

18 Aug 2015 | John R. Hershey, Zhuo Chen, Jonathan Le Roux, Shinji Watanabe
This paper introduces a deep clustering approach for acoustic source separation, where a deep network is trained to produce discriminative spectrogram embeddings that are useful for partition labels in the training data. Unlike previous methods that directly estimate signals or masking functions, this approach uses embeddings that can be clustered to infer the partition labels. The method avoids the high computational cost of spectral factorization by using an objective function that trains embeddings to approximate a low-rank ideal pairwise affinity matrix. This allows for compact clusters that are amenable to simple clustering methods. The embeddings implicitly encode the segmentations, which can be decoded by clustering. The method is tested on speech separation tasks, where it successfully separates speech signals from mixtures of two or three speakers, even when trained only on two-speaker mixtures. The model generalizes well to novel sources and can be applied to various domains beyond audio, such as image segmentation. The proposed method uses deep learning to derive embedding features that make the segmentation problem amenable to simple and efficient clustering algorithms. The experiments show that the method outperforms traditional spectral clustering and other approaches in terms of signal-to-distortion ratio (SDR) improvements. The results indicate that the method can achieve class-independent segmentation of arbitrary sounds, with potential applications in other domains. The paper also discusses the training procedure, the use of different embedding dimensions and activation functions, and the performance of the model on various types of mixtures. The results show that the method is effective in separating speech signals from mixtures of multiple speakers, even when trained on two-speaker mixtures. The method is also shown to generalize well to three-speaker mixtures, demonstrating its potential for real-world applications.This paper introduces a deep clustering approach for acoustic source separation, where a deep network is trained to produce discriminative spectrogram embeddings that are useful for partition labels in the training data. Unlike previous methods that directly estimate signals or masking functions, this approach uses embeddings that can be clustered to infer the partition labels. The method avoids the high computational cost of spectral factorization by using an objective function that trains embeddings to approximate a low-rank ideal pairwise affinity matrix. This allows for compact clusters that are amenable to simple clustering methods. The embeddings implicitly encode the segmentations, which can be decoded by clustering. The method is tested on speech separation tasks, where it successfully separates speech signals from mixtures of two or three speakers, even when trained only on two-speaker mixtures. The model generalizes well to novel sources and can be applied to various domains beyond audio, such as image segmentation. The proposed method uses deep learning to derive embedding features that make the segmentation problem amenable to simple and efficient clustering algorithms. The experiments show that the method outperforms traditional spectral clustering and other approaches in terms of signal-to-distortion ratio (SDR) improvements. The results indicate that the method can achieve class-independent segmentation of arbitrary sounds, with potential applications in other domains. The paper also discusses the training procedure, the use of different embedding dimensions and activation functions, and the performance of the model on various types of mixtures. The results show that the method is effective in separating speech signals from mixtures of multiple speakers, even when trained on two-speaker mixtures. The method is also shown to generalize well to three-speaker mixtures, demonstrating its potential for real-world applications.
Reach us at info@study.space
[slides and audio] Deep clustering%3A Discriminative embeddings for segmentation and separation