25 Apr 2024 | Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K. Marks
This paper introduces TI2V-Zero, a zero-shot text-conditioned image-to-video generation method that enables video synthesis without any optimization, fine-tuning, or external modules. The method leverages a pretrained text-to-video (T2V) diffusion model to condition on a provided image, allowing for video generation that aligns with a given text description. The approach uses a "repeat-and-slide" strategy to modulate the reverse denoising process, enabling frame-by-frame video synthesis starting from the provided image. To ensure temporal continuity, a DDPM inversion strategy is employed to initialize Gaussian noise for each new frame, and a resampling technique is used to preserve visual details. Comprehensive experiments on domain-specific and open-domain datasets show that TI2V-Zero consistently outperforms a recent open-domain TI2V model. Additionally, the method can be extended to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation. The framework is evaluated on multiple datasets, including MUG, UCF101, and a new open-domain dataset, demonstrating its effectiveness in generating temporally coherent and visually convincing videos. The method is also shown to be efficient and scalable, with the ability to generate long videos without additional training. The results indicate that TI2V-Zero achieves promising performance in text-conditioned image-to-video generation, with significant improvements in visual quality, temporal coherence, and sample diversity compared to existing methods.This paper introduces TI2V-Zero, a zero-shot text-conditioned image-to-video generation method that enables video synthesis without any optimization, fine-tuning, or external modules. The method leverages a pretrained text-to-video (T2V) diffusion model to condition on a provided image, allowing for video generation that aligns with a given text description. The approach uses a "repeat-and-slide" strategy to modulate the reverse denoising process, enabling frame-by-frame video synthesis starting from the provided image. To ensure temporal continuity, a DDPM inversion strategy is employed to initialize Gaussian noise for each new frame, and a resampling technique is used to preserve visual details. Comprehensive experiments on domain-specific and open-domain datasets show that TI2V-Zero consistently outperforms a recent open-domain TI2V model. Additionally, the method can be extended to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation. The framework is evaluated on multiple datasets, including MUG, UCF101, and a new open-domain dataset, demonstrating its effectiveness in generating temporally coherent and visually convincing videos. The method is also shown to be efficient and scalable, with the ability to generate long videos without additional training. The results indicate that TI2V-Zero achieves promising performance in text-conditioned image-to-video generation, with significant improvements in visual quality, temporal coherence, and sample diversity compared to existing methods.