Understanding ViD-GPT%3A Introducing GPT-style Autoregressive Generation in Video Diffusion Models

**ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models** **Authors:** Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao **Affiliations:** Zhejiang University, Huawei Cloud Computing, Nanyang Technological University, Finvolution Group **Contact Information:** kite_phone@zju.edu.cn, shijiaxin3@huawei.com, hanwangzhang@ntu.edu.sg, wangchunping02@xinye.com, junx@cs.zju.edu.cn **Abstract:** ViD-GPT introduces GPT-style autoregressive generation into video diffusion models (VDMs) to address the challenge of generating long, temporally consistent videos. Traditional VDMs use bidirectional computations, which limit the receptive context and result in a lack of long-term dependencies. ViD-GPT employs causal (unidirectional) generation, using past frames as prompts to generate future frames. It introduces causal temporal attention to ensure each generated frame depends only on its previous frames and uses a frame as prompt mechanism to enhance the denoising process. Additionally, ViD-GPT incorporates the kv-cache mechanism to eliminate redundant computations, significantly improving inference speed. Extensive experiments demonstrate that ViD-GPT achieves state-of-the-art performance on both quantitative and qualitative metrics for long video generation. **Contributions:** 1. Introduces GPT-style autoregressive generation into VDMs. 2. Introduces the kv-cache mechanism to boost inference speed. 3. Achieves state-of-the-art performance in long video generation. **Related Work:** - Video Diffusion Models: Focuses on training VDMs with temporal attention or convolution layers. - Long Video Generation: Extends short VDMs training-free or uses autoregressive methods with various design choices. **Method:** - **Causal Video Diffusion Models:** Uses spatial-temporal Transformers with causal temporal attention. - **Training with Frame as Prompt:** Enhances the denoising process by keeping prefix frames unnoised as prompts. - **Inference Boosted with KV-cache:** Eliminates redundant computations using a kv-cache mechanism. **Experiments:** - **Implementation Details:** Uses spatial-temporal Transformers, T5 as text encoder, and a pretrained VAE. - **Comparisons for Short Video Generation:** Evaluates on MSR-VTT and UCF-101 datasets. - **Comparisons for Long Video Generation:** Compares with Gen-L-Video, StreamT2V, and OpenSORA. - **Qualitative Results:** Shows better transition and long-term consistency. - **Inference Speed:** Significantly improves inference speed. **Conclusion and Limitations:** - ViD-GPT is a powerful paradigm for long video generation, but it has limitations such as low resolution and specific design for image-conditioned text-to-video**ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models** **Authors:** Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao **Affiliations:** Zhejiang University, Huawei Cloud Computing, Nanyang Technological University, Finvolution Group **Contact Information:** kite_phone@zju.edu.cn, shijiaxin3@huawei.com, hanwangzhang@ntu.edu.sg, wangchunping02@xinye.com, junx@cs.zju.edu.cn **Abstract:** ViD-GPT introduces GPT-style autoregressive generation into video diffusion models (VDMs) to address the challenge of generating long, temporally consistent videos. Traditional VDMs use bidirectional computations, which limit the receptive context and result in a lack of long-term dependencies. ViD-GPT employs causal (unidirectional) generation, using past frames as prompts to generate future frames. It introduces causal temporal attention to ensure each generated frame depends only on its previous frames and uses a frame as prompt mechanism to enhance the denoising process. Additionally, ViD-GPT incorporates the kv-cache mechanism to eliminate redundant computations, significantly improving inference speed. Extensive experiments demonstrate that ViD-GPT achieves state-of-the-art performance on both quantitative and qualitative metrics for long video generation. **Contributions:** 1. Introduces GPT-style autoregressive generation into VDMs. 2. Introduces the kv-cache mechanism to boost inference speed. 3. Achieves state-of-the-art performance in long video generation. **Related Work:** - Video Diffusion Models: Focuses on training VDMs with temporal attention or convolution layers. - Long Video Generation: Extends short VDMs training-free or uses autoregressive methods with various design choices. **Method:** - **Causal Video Diffusion Models:** Uses spatial-temporal Transformers with causal temporal attention. - **Training with Frame as Prompt:** Enhances the denoising process by keeping prefix frames unnoised as prompts. - **Inference Boosted with KV-cache:** Eliminates redundant computations using a kv-cache mechanism. **Experiments:** - **Implementation Details:** Uses spatial-temporal Transformers, T5 as text encoder, and a pretrained VAE. - **Comparisons for Short Video Generation:** Evaluates on MSR-VTT and UCF-101 datasets. - **Comparisons for Long Video Generation:** Compares with Gen-L-Video, StreamT2V, and OpenSORA. - **Qualitative Results:** Shows better transition and long-term consistency. - **Inference Speed:** Significantly improves inference speed. **Conclusion and Limitations:** - ViD-GPT is a powerful paradigm for long video generation, but it has limitations such as low resolution and specific design for image-conditioned text-to-video

ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

16 Jun 2024 | Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao