AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

10 Jun 2024 | Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, Yu-Gang Jiang
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction This paper introduces AID, a method for text-guided video prediction (TVP) that adapts Image2Video diffusion models to incorporate instruction control for generating controllable videos. The main challenge is to design a textual condition injection mechanism and adapt the model to the target dataset with minimal training cost. To address these challenges, the authors introduce a Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions. They design a dual query transformer (DQFormer) architecture that integrates instructions and frames into conditional embeddings for future frame prediction. Additionally, they develop Long-Short Term Temporal Adapters and Spatial Adapters to quickly transfer general video diffusion models to specific scenarios with minimal training costs. Experimental results show that AID significantly outperforms state-of-the-art techniques on four datasets: Something Something V2, Epic Kitchen-100, Bridge Data, and UCF-101. Notably, AID achieves 91.2% and 55.5% FVD improvements on Bridge and SSv2 respectively, demonstrating its effectiveness in various domains. The method is effective in generating videos that align with text instructions and maintain temporal consistency. The paper also discusses the design of three types of adapters: spatial, long-term temporal, and short-term temporal adapters, which enable model transfer with few parameters and computational costs. The results show that AID achieves over 50% improvement in Fréchet Video Distance (FVD) across multiple datasets compared to previous SoTA. The paper also presents ablation studies that validate the effectiveness of each component in the framework. The main contributions include the first transfer of an image-guided video generation model to multi-modal guided video prediction, the proposed DQFormer that aligns multi-modal conditions with instructions, and the design of temporal and spatial adapters for efficient training. The results demonstrate the effectiveness of transferring general generation models to specific video prediction tasks.AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction This paper introduces AID, a method for text-guided video prediction (TVP) that adapts Image2Video diffusion models to incorporate instruction control for generating controllable videos. The main challenge is to design a textual condition injection mechanism and adapt the model to the target dataset with minimal training cost. To address these challenges, the authors introduce a Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions. They design a dual query transformer (DQFormer) architecture that integrates instructions and frames into conditional embeddings for future frame prediction. Additionally, they develop Long-Short Term Temporal Adapters and Spatial Adapters to quickly transfer general video diffusion models to specific scenarios with minimal training costs. Experimental results show that AID significantly outperforms state-of-the-art techniques on four datasets: Something Something V2, Epic Kitchen-100, Bridge Data, and UCF-101. Notably, AID achieves 91.2% and 55.5% FVD improvements on Bridge and SSv2 respectively, demonstrating its effectiveness in various domains. The method is effective in generating videos that align with text instructions and maintain temporal consistency. The paper also discusses the design of three types of adapters: spatial, long-term temporal, and short-term temporal adapters, which enable model transfer with few parameters and computational costs. The results show that AID achieves over 50% improvement in Fréchet Video Distance (FVD) across multiple datasets compared to previous SoTA. The paper also presents ablation studies that validate the effectiveness of each component in the framework. The main contributions include the first transfer of an image-guided video generation model to multi-modal guided video prediction, the proposed DQFormer that aligns multi-modal conditions with instructions, and the design of temporal and spatial adapters for efficient training. The results demonstrate the effectiveness of transferring general generation models to specific video prediction tasks.
Reach us at info@study.space