WAVEGLOW: A FLOW-BASED GENERATIVE NETWORK FOR SPEECH SYNTHESIS

WAVEGLOW: A FLOW-BASED GENERATIVE NETWORK FOR SPEECH SYNTHESIS

31 Oct 2018 | Ryan Prenger, Rafael Valle, Bryan Catanzaro
WaveGlow is a flow-based generative network designed for high-quality speech synthesis from mel-spectrograms. It combines insights from Glow and WaveNet to achieve fast, efficient, and high-quality audio synthesis without the need for auto-regression. WaveGlow is implemented using a single network and trained with a single cost function, maximizing the likelihood of the training data, making the training process simple and stable. The PyTorch implementation of WaveGlow can produce audio samples at over 500 kHz on an NVIDIA V100 GPU, significantly faster than real-time. Mean Opinion Scores (MOS) show that WaveGlow delivers audio quality comparable to the best publicly available WaveNet implementation. The paper also discusses the challenges and limitations of existing speech synthesis models, highlighting the advantages of WaveGlow in terms of training simplicity and inference speed.WaveGlow is a flow-based generative network designed for high-quality speech synthesis from mel-spectrograms. It combines insights from Glow and WaveNet to achieve fast, efficient, and high-quality audio synthesis without the need for auto-regression. WaveGlow is implemented using a single network and trained with a single cost function, maximizing the likelihood of the training data, making the training process simple and stable. The PyTorch implementation of WaveGlow can produce audio samples at over 500 kHz on an NVIDIA V100 GPU, significantly faster than real-time. Mean Opinion Scores (MOS) show that WaveGlow delivers audio quality comparable to the best publicly available WaveNet implementation. The paper also discusses the challenges and limitations of existing speech synthesis models, highlighting the advantages of WaveGlow in terms of training simplicity and inference speed.
Reach us at info@study.space
[slides and audio] Waveglow%3A A Flow-based Generative Network for Speech Synthesis