DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS

DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS

30 Mar 2021 | Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, Bryan Catanzaro
DiffWave is a versatile diffusion probabilistic model for audio synthesis that generates high-fidelity audio in various waveform generation tasks, including neural vocoding, class-conditional generation, and unconditional generation. It is non-autoregressive and uses a Markov chain with a constant number of steps to convert white noise into structured waveforms. DiffWave is trained by optimizing a variational bound on the data likelihood, making it efficient and effective for audio synthesis. It outperforms autoregressive and GAN-based models in terms of audio quality and sample diversity, particularly in unconditional generation. DiffWave matches the performance of WaveNet in speech quality (MOS: 4.44 vs. 4.43) while being significantly faster. It uses a single ELBO-based training objective without auxiliary losses, making it flexible and efficient. DiffWave's architecture includes a bidirectional dilated convolution, allowing it to generate high-fidelity speech with a small footprint. It is suitable for both conditional and unconditional generation tasks and demonstrates superior performance in automatic and human evaluations. DiffWave is also efficient in training and inference, with a smaller memory footprint and faster synthesis speed compared to state-of-the-art models. The model is evaluated on various tasks, including neural vocoding, unconditional generation, and class-conditional generation, showing its versatility and effectiveness in generating high-quality audio.DiffWave is a versatile diffusion probabilistic model for audio synthesis that generates high-fidelity audio in various waveform generation tasks, including neural vocoding, class-conditional generation, and unconditional generation. It is non-autoregressive and uses a Markov chain with a constant number of steps to convert white noise into structured waveforms. DiffWave is trained by optimizing a variational bound on the data likelihood, making it efficient and effective for audio synthesis. It outperforms autoregressive and GAN-based models in terms of audio quality and sample diversity, particularly in unconditional generation. DiffWave matches the performance of WaveNet in speech quality (MOS: 4.44 vs. 4.43) while being significantly faster. It uses a single ELBO-based training objective without auxiliary losses, making it flexible and efficient. DiffWave's architecture includes a bidirectional dilated convolution, allowing it to generate high-fidelity speech with a small footprint. It is suitable for both conditional and unconditional generation tasks and demonstrates superior performance in automatic and human evaluations. DiffWave is also efficient in training and inference, with a smaller memory footprint and faster synthesis speed compared to state-of-the-art models. The model is evaluated on various tasks, including neural vocoding, unconditional generation, and class-conditional generation, showing its versatility and effectiveness in generating high-quality audio.
Reach us at info@study.space
[slides] DiffWave%3A A Versatile Diffusion Model for Audio Synthesis | StudySpace