Parallel WaveNet: Fast High-Fidelity Speech Synthesis

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

28 Nov 2017 | Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C. Coho, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, Demis Hassabis
Parallel WaveNet is a fast, high-fidelity speech synthesis system that improves upon the original WaveNet architecture by enabling parallel generation. The original WaveNet, while effective for speech synthesis, is sequential and slow due to its autoregressive nature. To address this, the authors introduce Probability Density Distillation, a method that trains a parallel feed-forward network from a trained WaveNet without significant loss in quality. The resulting system generates speech at over 20 times faster than real-time and is used by Google Assistant for multiple English and Japanese voices. WaveNet is a convolutional autoregressive model that predicts audio signals using causal convolutions. It models raw audio data with high temporal resolution, but its sequential generation makes it slow. Inverse autoregressive flows (IAFs) allow for parallel generation but require sequential inference for likelihood estimation. The paper proposes a new neural network distillation method, Probability Density Distillation, which uses a trained WaveNet as a teacher to train a parallel IAF model. This approach combines the efficient training of WaveNet with the efficient sampling of IAFs. The paper describes the original WaveNet model, the parallel version, and the distillation process. It also presents experimental results showing that the parallel WaveNet maintains high quality and outperforms previous benchmarks. The system achieves over 1000× speed-up in sample generation compared to the original WaveNet. Additional loss functions, such as power loss, perceptual loss, and contrastive loss, are introduced to improve audio quality and fidelity. The parallel WaveNet is trained using a combination of KL divergence, power loss, perceptual loss, and contrastive loss. The system is evaluated on multiple languages and speakers, showing significant improvements over baselines. The model is deployed in production at Google, providing real-time speech synthesis for millions of users. The paper concludes that the proposed method achieves high-fidelity speech synthesis with significant speed improvements, and can be applied to various domains.Parallel WaveNet is a fast, high-fidelity speech synthesis system that improves upon the original WaveNet architecture by enabling parallel generation. The original WaveNet, while effective for speech synthesis, is sequential and slow due to its autoregressive nature. To address this, the authors introduce Probability Density Distillation, a method that trains a parallel feed-forward network from a trained WaveNet without significant loss in quality. The resulting system generates speech at over 20 times faster than real-time and is used by Google Assistant for multiple English and Japanese voices. WaveNet is a convolutional autoregressive model that predicts audio signals using causal convolutions. It models raw audio data with high temporal resolution, but its sequential generation makes it slow. Inverse autoregressive flows (IAFs) allow for parallel generation but require sequential inference for likelihood estimation. The paper proposes a new neural network distillation method, Probability Density Distillation, which uses a trained WaveNet as a teacher to train a parallel IAF model. This approach combines the efficient training of WaveNet with the efficient sampling of IAFs. The paper describes the original WaveNet model, the parallel version, and the distillation process. It also presents experimental results showing that the parallel WaveNet maintains high quality and outperforms previous benchmarks. The system achieves over 1000× speed-up in sample generation compared to the original WaveNet. Additional loss functions, such as power loss, perceptual loss, and contrastive loss, are introduced to improve audio quality and fidelity. The parallel WaveNet is trained using a combination of KL divergence, power loss, perceptual loss, and contrastive loss. The system is evaluated on multiple languages and speakers, showing significant improvements over baselines. The model is deployed in production at Google, providing real-time speech synthesis for millions of users. The paper concludes that the proposed method achieves high-fidelity speech synthesis with significant speed improvements, and can be applied to various domains.
Reach us at info@study.space