25 Jun 2018 | Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, Koray Kavukcuoglu
The paper presents efficient sampling techniques for sequential models, particularly focusing on text-to-speech synthesis. It introduces the WaveRNN, a single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model while significantly reducing computational complexity. The WaveRNN can generate 24 kHz 16-bit audio at 4 times real-time speed on a GPU. The authors also apply weight pruning to reduce the number of weights, showing that large sparse networks outperform small dense networks, even at high sparsity levels. Additionally, they propose a subscaling technique that allows generating multiple samples at once, increasing sampling efficiency without compromising quality. The Subscale WaveRNN can produce 16 samples per step without loss of quality, and it can be implemented on mobile CPUs for real-time audio synthesis. The paper evaluates these methods on a North American English text-to-speech dataset, demonstrating their effectiveness in terms of quality and speed.The paper presents efficient sampling techniques for sequential models, particularly focusing on text-to-speech synthesis. It introduces the WaveRNN, a single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model while significantly reducing computational complexity. The WaveRNN can generate 24 kHz 16-bit audio at 4 times real-time speed on a GPU. The authors also apply weight pruning to reduce the number of weights, showing that large sparse networks outperform small dense networks, even at high sparsity levels. Additionally, they propose a subscaling technique that allows generating multiple samples at once, increasing sampling efficiency without compromising quality. The Subscale WaveRNN can produce 16 samples per step without loss of quality, and it can be implemented on mobile CPUs for real-time audio synthesis. The paper evaluates these methods on a North American English text-to-speech dataset, demonstrating their effectiveness in terms of quality and speed.