ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

2024-06-16 | Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chumping Wang, Jun Xiao
ViD-GPT introduces GPT-style autoregressive generation into video diffusion models (VDMs), enabling long video generation with improved quality and efficiency. The key innovations include causal generation, where each frame depends only on previous frames, and frame as prompt, which uses clean frames as input to guide denoising. A kv-cache mechanism is also introduced to reduce redundant computations and boost inference speed. The model is trained on large-scale video-text datasets and achieves state-of-the-art performance on both quantitative and qualitative metrics. Experiments show that ViD-GPT outperforms existing methods in long video generation, with better temporal consistency and reduced content mutations. The model is also efficient, with significantly faster inference speeds compared to baselines. However, the model is limited by computational resources and is currently designed for image-conditioned text-to-video generation.ViD-GPT introduces GPT-style autoregressive generation into video diffusion models (VDMs), enabling long video generation with improved quality and efficiency. The key innovations include causal generation, where each frame depends only on previous frames, and frame as prompt, which uses clean frames as input to guide denoising. A kv-cache mechanism is also introduced to reduce redundant computations and boost inference speed. The model is trained on large-scale video-text datasets and achieves state-of-the-art performance on both quantitative and qualitative metrics. Experiments show that ViD-GPT outperforms existing methods in long video generation, with better temporal consistency and reduced content mutations. The model is also efficient, with significantly faster inference speeds compared to baselines. However, the model is limited by computational resources and is currently designed for image-conditioned text-to-video generation.
Reach us at info@study.space
[slides] ViD-GPT%3A Introducing GPT-style Autoregressive Generation in Video Diffusion Models | StudySpace