WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

19 Sep 2016 | Aäron van den Oord, Sander Dieleman, Heiga Zen†, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu
This paper introduces WaveNet, a deep neural network designed to generate raw audio waveforms. The model is fully probabilistic and autoregressive, conditioning each audio sample on all previous samples. Despite its complexity, WaveNet can be efficiently trained on data with tens of thousands of samples per second. When applied to text-to-speech (TTS), WaveNet achieves state-of-the-art performance, with human listeners rating it as significantly more natural than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of multiple speakers and switch between them by conditioning on speaker identity. When trained to model music, WaveNet generates novel and highly realistic musical fragments. Additionally, it shows promising results in speech recognition tasks, achieving the best score obtained from a model trained directly on raw audio on the TIMIT dataset. WaveNet provides a flexible framework for various audio generation applications, including TTS, music, and speech enhancement.This paper introduces WaveNet, a deep neural network designed to generate raw audio waveforms. The model is fully probabilistic and autoregressive, conditioning each audio sample on all previous samples. Despite its complexity, WaveNet can be efficiently trained on data with tens of thousands of samples per second. When applied to text-to-speech (TTS), WaveNet achieves state-of-the-art performance, with human listeners rating it as significantly more natural than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of multiple speakers and switch between them by conditioning on speaker identity. When trained to model music, WaveNet generates novel and highly realistic musical fragments. Additionally, it shows promising results in speech recognition tasks, achieving the best score obtained from a model trained directly on raw audio on the TIMIT dataset. WaveNet provides a flexible framework for various audio generation applications, including TTS, music, and speech enhancement.
Reach us at info@study.space