31 Oct 2018 | Ryan Prenger, Rafael Valle, Bryan Catanzaro
WaveGlow is a flow-based generative network for speech synthesis that produces high-quality audio from mel-spectrograms. It combines ideas from Glow and WaveNet to achieve fast, efficient, and high-quality audio synthesis without the need for auto-regression. WaveGlow is implemented using a single network trained with a single cost function, maximizing the likelihood of the training data. The PyTorch implementation produces audio samples at over 500 kHz on an NVIDIA V100 GPU, with audio quality comparable to the best publicly available WaveNet implementation. The model is simple to implement and train, using only a single network and likelihood loss function. WaveGlow uses an affine coupling layer and invertible 1x1 convolution to condition the generated result on the input. It also includes early outputs to help gradients propagate to earlier layers. The model is trained on the LJ speech dataset and achieves high MOS scores in audio quality tests. WaveGlow is faster than other models, with inference speeds up to 2,000 kHz on an NVIDIA GV100 GPU. It is more efficient than auto-regressive models and has a simpler training process. WaveGlow is a flow-based model that enables efficient speech synthesis with a simple model that is easy to train.WaveGlow is a flow-based generative network for speech synthesis that produces high-quality audio from mel-spectrograms. It combines ideas from Glow and WaveNet to achieve fast, efficient, and high-quality audio synthesis without the need for auto-regression. WaveGlow is implemented using a single network trained with a single cost function, maximizing the likelihood of the training data. The PyTorch implementation produces audio samples at over 500 kHz on an NVIDIA V100 GPU, with audio quality comparable to the best publicly available WaveNet implementation. The model is simple to implement and train, using only a single network and likelihood loss function. WaveGlow uses an affine coupling layer and invertible 1x1 convolution to condition the generated result on the input. It also includes early outputs to help gradients propagate to earlier layers. The model is trained on the LJ speech dataset and achieves high MOS scores in audio quality tests. WaveGlow is faster than other models, with inference speeds up to 2,000 kHz on an NVIDIA GV100 GPU. It is more efficient than auto-regressive models and has a simpler training process. WaveGlow is a flow-based model that enables efficient speech synthesis with a simple model that is easy to train.