25 Apr 2024 | Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K. Marks
The paper introduces T12V-Zero, a zero-shot, tuning-free method for text-conditioned image-to-video generation (T12V). T12V-Zero leverages a pretrained text-to-video (T2V) diffusion model to condition on an image and generate a video without any optimization or fine-tuning. The method uses a "reverse-and-slide" strategy to modulate the reverse denoising process, allowing the frozen diffusion model to synthesize video frames frame-by-frame from the provided image. To ensure temporal continuity, the approach employs a DDPM inversion strategy to initialize Gaussian noise for each frame and a resampling technique to preserve visual details. Extensive experiments on domain-specific and open-domain datasets demonstrate that T12V-Zero outperforms a recent open-domain T12V model and can be extended to tasks like video infilling and prediction. The autoregressive design also supports long video generation.The paper introduces T12V-Zero, a zero-shot, tuning-free method for text-conditioned image-to-video generation (T12V). T12V-Zero leverages a pretrained text-to-video (T2V) diffusion model to condition on an image and generate a video without any optimization or fine-tuning. The method uses a "reverse-and-slide" strategy to modulate the reverse denoising process, allowing the frozen diffusion model to synthesize video frames frame-by-frame from the provided image. To ensure temporal continuity, the approach employs a DDPM inversion strategy to initialize Gaussian noise for each frame and a resampling technique to preserve visual details. Extensive experiments on domain-specific and open-domain datasets demonstrate that T12V-Zero outperforms a recent open-domain T12V model and can be extended to tasks like video infilling and prediction. The autoregressive design also supports long video generation.