22 Jun 2022 | Jonathan Ho*, Tim Salimans*, Alexey Gritsenko, William Chan, Mohammad Norouzi, David J. Fleet
This paper introduces a video diffusion model that generates high-quality, temporally coherent videos. The model is an extension of the standard image diffusion architecture, enabling joint training on image and video data, which reduces gradient variance and speeds up optimization. A new conditional sampling technique is introduced for spatial and temporal video extension, outperforming previous methods. The model achieves state-of-the-art results on video prediction and unconditional video generation tasks, and presents the first results on a large text-conditioned video generation task.
The model uses a 3D U-Net architecture, which is factorized over space and time, allowing for efficient processing of video data. The model is trained to jointly model a fixed number of frames at a fixed spatial resolution. To generate longer and higher resolution videos, a new reconstruction-guided sampling method is introduced, which improves conditional generation by incorporating information from previously generated frames.
The model is evaluated on several benchmarks, including unconditional video generation, video prediction, and text-conditioned video generation. Results show that the model achieves high-quality video samples, with improved performance compared to previous methods. The model also demonstrates the benefits of joint training on video and image modeling objectives, and the effectiveness of classifier-free guidance in improving sample quality.
The paper also discusses the potential societal implications of the model, noting that while it has the potential to positively impact creative applications, it could also be misused for harmful purposes such as generating fake content. The authors therefore decide not to release their models, emphasizing the need for careful curation to ensure fair and ethical use.This paper introduces a video diffusion model that generates high-quality, temporally coherent videos. The model is an extension of the standard image diffusion architecture, enabling joint training on image and video data, which reduces gradient variance and speeds up optimization. A new conditional sampling technique is introduced for spatial and temporal video extension, outperforming previous methods. The model achieves state-of-the-art results on video prediction and unconditional video generation tasks, and presents the first results on a large text-conditioned video generation task.
The model uses a 3D U-Net architecture, which is factorized over space and time, allowing for efficient processing of video data. The model is trained to jointly model a fixed number of frames at a fixed spatial resolution. To generate longer and higher resolution videos, a new reconstruction-guided sampling method is introduced, which improves conditional generation by incorporating information from previously generated frames.
The model is evaluated on several benchmarks, including unconditional video generation, video prediction, and text-conditioned video generation. Results show that the model achieves high-quality video samples, with improved performance compared to previous methods. The model also demonstrates the benefits of joint training on video and image modeling objectives, and the effectiveness of classifier-free guidance in improving sample quality.
The paper also discusses the potential societal implications of the model, noting that while it has the potential to positively impact creative applications, it could also be misused for harmful purposes such as generating fake content. The authors therefore decide not to release their models, emphasizing the need for careful curation to ensure fair and ethical use.