DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS

DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS

30 Mar 2021 | Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, Bryan Catanzaro
DiffWave is a versatile diffusion probabilistic model designed for conditional and unconditional waveform generation. Unlike autoregressive models, DiffWave is non-autoregressive and converts white noise into structured waveforms through a Markov chain with a constant number of steps. It is trained efficiently by optimizing a variant of the variational bound on the data likelihood. DiffWave produces high-fidelity audio in various tasks, including neural vocoding conditioned on mel spectrograms, class-conditional generation, and unconditional generation. It matches the quality of strong WaveNet vocoders (MOS: 4.44 vs. 4.43) while synthesizing orders of magnitude faster. DiffWave outperforms autoregressive and GAN-based waveform models in unconditional generation tasks, as measured by both automatic and human evaluations. The model's architecture is based on a bidirectional dilated convolution, which allows for parallel waveform synthesis and flexible conditioning. DiffWave is also smaller in size compared to other neural vocoders, making it suitable for real-time applications.DiffWave is a versatile diffusion probabilistic model designed for conditional and unconditional waveform generation. Unlike autoregressive models, DiffWave is non-autoregressive and converts white noise into structured waveforms through a Markov chain with a constant number of steps. It is trained efficiently by optimizing a variant of the variational bound on the data likelihood. DiffWave produces high-fidelity audio in various tasks, including neural vocoding conditioned on mel spectrograms, class-conditional generation, and unconditional generation. It matches the quality of strong WaveNet vocoders (MOS: 4.44 vs. 4.43) while synthesizing orders of magnitude faster. DiffWave outperforms autoregressive and GAN-based waveform models in unconditional generation tasks, as measured by both automatic and human evaluations. The model's architecture is based on a bidirectional dilated convolution, which allows for parallel waveform synthesis and flexible conditioning. DiffWave is also smaller in size compared to other neural vocoders, making it suitable for real-time applications.
Reach us at info@study.space