Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

15 May 2019 | Yi Luo, Nima Mesgarani
Conv-TasNet is a deep learning framework for end-to-end time-domain speech separation, addressing the limitations of traditional time-frequency (T-F) methods. It uses a fully convolutional time-domain audio separation network (Conv-TasNet) to directly estimate speaker waveforms from the mixture waveform without decoupling the phase and magnitude. The system employs a linear encoder to generate a representation optimized for separating individual speakers, followed by a temporal convolutional network (TCN) to estimate masks for each speaker. The masks are then applied to the encoder output, and the modified representations are inverted back to waveforms using a linear decoder. Conv-TasNet outperforms previous T-F masking methods in separating two- and three-speaker mixtures, surpassing ideal time-frequency magnitude masks in both objective and subjective evaluations. It also offers a smaller model size and shorter minimum latency, making it suitable for real-time and low-latency applications. The paper discusses the architecture, experimental procedures, and results, highlighting the advantages of Conv-TasNet over existing methods.Conv-TasNet is a deep learning framework for end-to-end time-domain speech separation, addressing the limitations of traditional time-frequency (T-F) methods. It uses a fully convolutional time-domain audio separation network (Conv-TasNet) to directly estimate speaker waveforms from the mixture waveform without decoupling the phase and magnitude. The system employs a linear encoder to generate a representation optimized for separating individual speakers, followed by a temporal convolutional network (TCN) to estimate masks for each speaker. The masks are then applied to the encoder output, and the modified representations are inverted back to waveforms using a linear decoder. Conv-TasNet outperforms previous T-F masking methods in separating two- and three-speaker mixtures, surpassing ideal time-frequency magnitude masks in both objective and subjective evaluations. It also offers a smaller model size and shorter minimum latency, making it suitable for real-time and low-latency applications. The paper discusses the architecture, experimental procedures, and results, highlighting the advantages of Conv-TasNet over existing methods.
Reach us at info@study.space