WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

17 Jun 2022 | Sanyuan Chen*, Chengyi Wang*, Zhengyang Chen*, Yu Wu*, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, Furu Wei
WavLM is a large-scale self-supervised pre-training model designed for full-stack speech processing tasks. It jointly learns masked speech prediction and denoising during pre-training, enabling it to model both speech content and non-ASR tasks. WavLM incorporates gated relative position bias in the Transformer structure to better capture sequence ordering in speech. The model is trained on 94k hours of public audio data, including 60k hours of Libri-Light, 10k hours of GigaSpeech, and 24k hours of VoxPopuli. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark and significantly improves various speech processing tasks. It outperforms HuBERT Large on 14 subtasks and achieves a 2.4 point improvement in overall evaluation. WavLM also performs well on speaker verification, speech separation, and speaker diarization tasks. The model's structure and training data are optimized, and it is released to facilitate future research. WavLM's masked speech denoising and prediction framework allows it to handle non-ASR tasks by implicitly modeling information needed for speaker identification, separation, and diarization. The model's performance is evaluated on nineteen subtasks, including fifteen from SUPERB and four classic speech tasks. WavLM Base+ achieves better results than HuBERT Large due to three modifications. The model's performance is further improved by using a larger and more diverse dataset and longer training steps. WavLM's contributions include proposing a general pre-trained model for full-stack speech processing, modifying existing pre-trained models for general improvements, scaling up self-supervised pre-training, and achieving SOTA results on the SUPERB benchmark. The model is evaluated on various speech tasks, including speaker verification, speech separation, and speaker diarization, demonstrating its effectiveness in handling diverse speech processing tasks.WavLM is a large-scale self-supervised pre-training model designed for full-stack speech processing tasks. It jointly learns masked speech prediction and denoising during pre-training, enabling it to model both speech content and non-ASR tasks. WavLM incorporates gated relative position bias in the Transformer structure to better capture sequence ordering in speech. The model is trained on 94k hours of public audio data, including 60k hours of Libri-Light, 10k hours of GigaSpeech, and 24k hours of VoxPopuli. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark and significantly improves various speech processing tasks. It outperforms HuBERT Large on 14 subtasks and achieves a 2.4 point improvement in overall evaluation. WavLM also performs well on speaker verification, speech separation, and speaker diarization tasks. The model's structure and training data are optimized, and it is released to facilitate future research. WavLM's masked speech denoising and prediction framework allows it to handle non-ASR tasks by implicitly modeling information needed for speaker identification, separation, and diarization. The model's performance is evaluated on nineteen subtasks, including fifteen from SUPERB and four classic speech tasks. WavLM Base+ achieves better results than HuBERT Large due to three modifications. The model's performance is further improved by using a larger and more diverse dataset and longer training steps. WavLM's contributions include proposing a general pre-trained model for full-stack speech processing, modifying existing pre-trained models for general improvements, scaling up self-supervised pre-training, and achieving SOTA results on the SUPERB benchmark. The model is evaluated on various speech tasks, including speaker verification, speech separation, and speaker diarization, demonstrating its effectiveness in handling diverse speech processing tasks.
Reach us at info@study.space