22 Jun 2022 | Jonathan Ho*, Tim Salimans*, Alexey Gritsenko, William Chan, Mohammad Norouzi, David J. Fleet
The paper introduces a diffusion model for video generation, extending the standard image diffusion architecture to handle video data. The model is trained jointly on image and video data, which reduces gradient variance and speeds up optimization. For generating longer and higher-resolution videos, a new conditional sampling technique is introduced, outperforming previous methods. The model achieves state-of-the-art results on video prediction and unconditional video generation benchmarks, as well as promising initial results on text-conditioned video generation. The authors also propose a reconstruction-guided sampling method for conditional sampling, which improves the quality of generated videos. The paper discusses the benefits of joint image-video training and classifier-free guidance, and highlights the potential societal implications of the model, emphasizing the need for bias auditing and addressing ethical concerns.The paper introduces a diffusion model for video generation, extending the standard image diffusion architecture to handle video data. The model is trained jointly on image and video data, which reduces gradient variance and speeds up optimization. For generating longer and higher-resolution videos, a new conditional sampling technique is introduced, outperforming previous methods. The model achieves state-of-the-art results on video prediction and unconditional video generation benchmarks, as well as promising initial results on text-conditioned video generation. The authors also propose a reconstruction-guided sampling method for conditional sampling, which improves the quality of generated videos. The paper discusses the benefits of joint image-video training and classifier-free guidance, and highlights the potential societal implications of the model, emphasizing the need for bias auditing and addressing ethical concerns.