[slides and audio] AID%3A Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

The paper "AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction" addresses the challenge of text-guided video prediction (TVP), which involves predicting future video frames based on initial frames and textual instructions. The authors leverage pre-trained Image2Video diffusion models, such as Stable Video Diffusion (SVD), to incorporate textual control while maintaining video dynamics. They introduce a Multi-Modal Large Language Model (MLLM) to predict future video states and a Dual Query Transformer (DQFormer) architecture to integrate textual and visual conditions into conditional embeddings. Additionally, they develop Long-Short Term Temporal Adapters and Spatial Adapters to enhance model adaptability and efficiency. Experimental results on four datasets (Something-Something V2, Epic Kitchen-100, Bridge Data, and UCF-101) demonstrate that AID significantly outperforms state-of-the-art techniques, achieving over 50% improvement in Fréchet Video Distance (FVD) and 91.2% improvement on Bridge Data. The paper also includes ablation studies to validate the effectiveness of each component in the proposed framework.The paper "AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction" addresses the challenge of text-guided video prediction (TVP), which involves predicting future video frames based on initial frames and textual instructions. The authors leverage pre-trained Image2Video diffusion models, such as Stable Video Diffusion (SVD), to incorporate textual control while maintaining video dynamics. They introduce a Multi-Modal Large Language Model (MLLM) to predict future video states and a Dual Query Transformer (DQFormer) architecture to integrate textual and visual conditions into conditional embeddings. Additionally, they develop Long-Short Term Temporal Adapters and Spatial Adapters to enhance model adaptability and efficiency. Experimental results on four datasets (Something-Something V2, Epic Kitchen-100, Bridge Data, and UCF-101) demonstrate that AID significantly outperforms state-of-the-art techniques, achieving over 50% improvement in Fréchet Video Distance (FVD) and 91.2% improvement on Bridge Data. The paper also includes ablation studies to validate the effectiveness of each component in the proposed framework.

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

10 Jun 2024 | Zhen Xing1 Qi Dai2 Zejia Weng1 Zuxuan Wu 1 Yu-Gang Jiang 1