18 Aug 2015 | John R. Hershey, Zhuo Chen, Jonathan Le Roux, Shinji Watanabe
The paper introduces a deep learning framework called "deep clustering" for acoustic source separation. Unlike traditional methods that directly estimate signals or masking functions, deep clustering trains a deep network to produce spectrogram embeddings that are discriminative for partition labels. This approach leverages the strengths of deep networks in learning power and speed while addressing the limitations of spectral clustering methods, which can be computationally expensive and less flexible with respect to the number of classes and items to be segmented.
The proposed method uses an objective function to train embeddings that yield a low-rank approximation of an ideal pairwise affinity matrix, making it class-independent and efficient. The segmentations are implicitly encoded in the embeddings and can be decoded using simple clustering methods. Preliminary experiments show that the method can effectively separate speech from mixtures of two speakers, improving signal quality by around 6dB. The model also generalizes to three-speaker mixtures despite being trained only on two-speaker mixtures, demonstrating its potential for class-independent segmentation of arbitrary sounds.
The paper discusses the motivation behind the partition-based approach, the challenges of class-based approaches, and the advantages of spectral clustering. It also presents the experimental setup, including the dataset and training procedure, and evaluates the performance of the proposed method using various clustering methods and embedding dimensions. The results show that the deep clustering method outperforms existing methods, particularly in open speaker scenarios, and suggests future directions for further improvement and application to other domains such as image segmentation.The paper introduces a deep learning framework called "deep clustering" for acoustic source separation. Unlike traditional methods that directly estimate signals or masking functions, deep clustering trains a deep network to produce spectrogram embeddings that are discriminative for partition labels. This approach leverages the strengths of deep networks in learning power and speed while addressing the limitations of spectral clustering methods, which can be computationally expensive and less flexible with respect to the number of classes and items to be segmented.
The proposed method uses an objective function to train embeddings that yield a low-rank approximation of an ideal pairwise affinity matrix, making it class-independent and efficient. The segmentations are implicitly encoded in the embeddings and can be decoded using simple clustering methods. Preliminary experiments show that the method can effectively separate speech from mixtures of two speakers, improving signal quality by around 6dB. The model also generalizes to three-speaker mixtures despite being trained only on two-speaker mixtures, demonstrating its potential for class-independent segmentation of arbitrary sounds.
The paper discusses the motivation behind the partition-based approach, the challenges of class-based approaches, and the advantages of spectral clustering. It also presents the experimental setup, including the dataset and training procedure, and evaluates the performance of the proposed method using various clustering methods and embedding dimensions. The results show that the deep clustering method outperforms existing methods, particularly in open speaker scenarios, and suggests future directions for further improvement and application to other domains such as image segmentation.