WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

19 Sep 2016 | Aäron van den Oord, Sander Dieleman, Heiga Zen†, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu
WaveNet is a deep neural network for generating raw audio waveforms, fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones. It can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech (TTS), it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity and can switch between them by conditioning on the speaker identity. When trained to model music, it generates novel and often highly realistic musical fragments. It can also be employed as a discriminative model, returning promising results for phoneme recognition. WaveNet is based on the PixelCNN architecture and uses dilated causal convolutions to increase the receptive field without greatly increasing computational cost. It models the conditional distributions of individual audio samples using a softmax distribution, which is more flexible and can model arbitrary distributions. It also uses gated activation units, residual and skip connections, and conditional WaveNets to model the conditional distribution of audio given additional inputs. WaveNet can be conditioned on global or local inputs, such as speaker identity or linguistic features, to guide the generation of audio with the required characteristics. WaveNet was evaluated on three tasks: multi-speaker speech generation, TTS, and music audio modeling. In multi-speaker speech generation, it was able to model speech from any of the speakers by conditioning on a one-hot encoding of a speaker. In TTS, it was able to synthesize speech samples with natural segmental quality but sometimes had unnatural prosody by stressing wrong words in a sentence. This could be due to the long-term dependency of F0 contours. WaveNet conditioned on both linguistic features and F0 values did not have this problem. In music audio modeling, it generated musical samples that sounded harmonic and aesthetically pleasing. It was also adapted to discriminative audio tasks such as speech recognition, achieving a high performance on the TIMIT dataset. WaveNet provides a generic and flexible framework for tackling many applications that rely on audio generation, such as TTS, music, speech enhancement, voice conversion, and source separation. It has shown promising results in various audio generation tasks and has the potential to be used in a wide range of applications.WaveNet is a deep neural network for generating raw audio waveforms, fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones. It can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech (TTS), it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity and can switch between them by conditioning on the speaker identity. When trained to model music, it generates novel and often highly realistic musical fragments. It can also be employed as a discriminative model, returning promising results for phoneme recognition. WaveNet is based on the PixelCNN architecture and uses dilated causal convolutions to increase the receptive field without greatly increasing computational cost. It models the conditional distributions of individual audio samples using a softmax distribution, which is more flexible and can model arbitrary distributions. It also uses gated activation units, residual and skip connections, and conditional WaveNets to model the conditional distribution of audio given additional inputs. WaveNet can be conditioned on global or local inputs, such as speaker identity or linguistic features, to guide the generation of audio with the required characteristics. WaveNet was evaluated on three tasks: multi-speaker speech generation, TTS, and music audio modeling. In multi-speaker speech generation, it was able to model speech from any of the speakers by conditioning on a one-hot encoding of a speaker. In TTS, it was able to synthesize speech samples with natural segmental quality but sometimes had unnatural prosody by stressing wrong words in a sentence. This could be due to the long-term dependency of F0 contours. WaveNet conditioned on both linguistic features and F0 values did not have this problem. In music audio modeling, it generated musical samples that sounded harmonic and aesthetically pleasing. It was also adapted to discriminative audio tasks such as speech recognition, achieving a high performance on the TIMIT dataset. WaveNet provides a generic and flexible framework for tackling many applications that rely on audio generation, such as TTS, music, speech enhancement, voice conversion, and source separation. It has shown promising results in various audio generation tasks and has the potential to be used in a wide range of applications.
Reach us at info@study.space
Understanding WaveNet%3A A Generative Model for Raw Audio