WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

17 Jun 2022 | Sanyuan Chen*, Chengyi Wang*, Zhengyang Chen*, Yu Wu*, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, Furu Wei
WavLM is a novel pre-trained model designed to address a wide range of speech processing tasks, including speaker verification, automatic speech recognition (ASR), speech separation, and speaker diarization. The model leverages large-scale unlabeled speech data to learn universal representations, which are then fine-tuned for specific downstream tasks. Key contributions include: 1. **Masked Speech Denoising and Prediction**: WavLM incorporates a masked speech denoising and prediction framework, where some inputs are simulated as noisy or overlapped speech. This allows the model to learn not only ASR information but also improve performance on non-ASR tasks such as speaker verification and speech separation. 2. **Transformer Structure Optimization**: The model employs gated relative position bias to better capture the sequence ordering of input speech, enhancing its ability to handle complex acoustic environments and speaker identity modeling. 3. **Data Scaling**: The training dataset is expanded from 60k hours to 94k hours, including diverse sources like Libri-Light, GigaSpeech, and VoxPopuli, to improve model robustness and generalization. 4. **State-of-the-Art Performance**: WavLM achieves state-of-the-art performance on the SUPERB benchmark and significantly outperforms existing models on various speech processing tasks, including speaker verification, speech separation, and speaker diarization. 5. **Code and Models**: The code and pre-trained models are available at https://aka.ms/wavlm, facilitating further research and development in speech processing. The paper also discusses related work, background on HuBERT, and detailed experimental results, demonstrating the effectiveness and versatility of WavLM in handling full-stack speech processing tasks.WavLM is a novel pre-trained model designed to address a wide range of speech processing tasks, including speaker verification, automatic speech recognition (ASR), speech separation, and speaker diarization. The model leverages large-scale unlabeled speech data to learn universal representations, which are then fine-tuned for specific downstream tasks. Key contributions include: 1. **Masked Speech Denoising and Prediction**: WavLM incorporates a masked speech denoising and prediction framework, where some inputs are simulated as noisy or overlapped speech. This allows the model to learn not only ASR information but also improve performance on non-ASR tasks such as speaker verification and speech separation. 2. **Transformer Structure Optimization**: The model employs gated relative position bias to better capture the sequence ordering of input speech, enhancing its ability to handle complex acoustic environments and speaker identity modeling. 3. **Data Scaling**: The training dataset is expanded from 60k hours to 94k hours, including diverse sources like Libri-Light, GigaSpeech, and VoxPopuli, to improve model robustness and generalization. 4. **State-of-the-Art Performance**: WavLM achieves state-of-the-art performance on the SUPERB benchmark and significantly outperforms existing models on various speech processing tasks, including speaker verification, speech separation, and speaker diarization. 5. **Code and Models**: The code and pre-trained models are available at https://aka.ms/wavlm, facilitating further research and development in speech processing. The paper also discusses related work, background on HuBERT, and detailed experimental results, demonstrating the effectiveness and versatility of WavLM in handling full-stack speech processing tasks.
Reach us at info@study.space