25 Jun 2018 | Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, Koray Kavukcuoglu
Efficient Neural Audio Synthesis presents a novel approach to generate high-quality audio samples efficiently. The paper introduces WaveRNN, a single-layer recurrent neural network with a dual softmax layer that matches the performance of the state-of-the-art WaveNet model. WaveRNN is designed to efficiently predict 16-bit raw audio samples and achieves high audio fidelity with significantly reduced sampling time. The model is optimized for real-time audio synthesis on GPUs and mobile CPUs through techniques such as weight pruning, structured sparsity, and subscaling.
WaveRNN reduces the number of operations required for each sample, enabling faster sampling. The model is implemented as a single persistent GPU operation, which significantly improves performance by reducing memory bandwidth bottlenecks. Weight pruning is used to reduce the number of parameters, leading to more efficient sampling on mobile CPUs. The Sparse WaveRNN, which uses structured sparsity, achieves high performance with a small number of parameters and low memory bandwidth requirements.
The paper also introduces the Subscale WaveRNN, which allows generating multiple samples at once by folding a long sequence into a batch of shorter sequences. This method enables efficient sampling by reducing the dependency on future samples, allowing the model to generate multiple samples in parallel. The Subscale WaveRNN achieves high audio fidelity while significantly improving sampling efficiency.
The paper evaluates the performance of the models on a text-to-speech synthesis task, demonstrating that the proposed methods significantly improve sampling speed without compromising audio quality. The results show that the Sparse WaveRNN and Subscale WaveRNN achieve high performance on both GPU and mobile CPU platforms, making them suitable for real-time audio synthesis. The methods presented have broad implications for efficient neural audio synthesis and can be applied to various domains beyond audio.Efficient Neural Audio Synthesis presents a novel approach to generate high-quality audio samples efficiently. The paper introduces WaveRNN, a single-layer recurrent neural network with a dual softmax layer that matches the performance of the state-of-the-art WaveNet model. WaveRNN is designed to efficiently predict 16-bit raw audio samples and achieves high audio fidelity with significantly reduced sampling time. The model is optimized for real-time audio synthesis on GPUs and mobile CPUs through techniques such as weight pruning, structured sparsity, and subscaling.
WaveRNN reduces the number of operations required for each sample, enabling faster sampling. The model is implemented as a single persistent GPU operation, which significantly improves performance by reducing memory bandwidth bottlenecks. Weight pruning is used to reduce the number of parameters, leading to more efficient sampling on mobile CPUs. The Sparse WaveRNN, which uses structured sparsity, achieves high performance with a small number of parameters and low memory bandwidth requirements.
The paper also introduces the Subscale WaveRNN, which allows generating multiple samples at once by folding a long sequence into a batch of shorter sequences. This method enables efficient sampling by reducing the dependency on future samples, allowing the model to generate multiple samples in parallel. The Subscale WaveRNN achieves high audio fidelity while significantly improving sampling efficiency.
The paper evaluates the performance of the models on a text-to-speech synthesis task, demonstrating that the proposed methods significantly improve sampling speed without compromising audio quality. The results show that the Sparse WaveRNN and Subscale WaveRNN achieve high performance on both GPU and mobile CPU platforms, making them suitable for real-time audio synthesis. The methods presented have broad implications for efficient neural audio synthesis and can be applied to various domains beyond audio.