VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

2024-01-17 | Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, Ying Shan
VideoCrafter2 is a method for training high-quality video diffusion models without relying on high-quality videos. The paper explores the training of video models extended from Stable Diffusion and investigates the feasibility of using low-quality videos and high-quality images to achieve a high-quality video model. The authors analyze the connection between spatial and temporal modules of video models and the distribution shift to low-quality videos. They observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, they shift the distribution to higher quality without motion degradation by fine-tuning spatial modules with high-quality images, resulting in a generic high-quality video model. The paper evaluates the proposed method and demonstrates its superiority in picture quality, motion, and concept composition. The method involves fully training a video model with low-quality videos and then fine-tuning the spatial modules with high-quality images. The authors also propose using synthesized images with complex concepts for fine-tuning to improve concept composition. The method is evaluated on a benchmark and compared with state-of-the-art text-to-video generation models, showing that it achieves comparable or better performance in terms of visual quality, motion quality, and text-video alignment. The results indicate that the proposed method can generate high-quality videos without requiring high-quality videos for training.VideoCrafter2 is a method for training high-quality video diffusion models without relying on high-quality videos. The paper explores the training of video models extended from Stable Diffusion and investigates the feasibility of using low-quality videos and high-quality images to achieve a high-quality video model. The authors analyze the connection between spatial and temporal modules of video models and the distribution shift to low-quality videos. They observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, they shift the distribution to higher quality without motion degradation by fine-tuning spatial modules with high-quality images, resulting in a generic high-quality video model. The paper evaluates the proposed method and demonstrates its superiority in picture quality, motion, and concept composition. The method involves fully training a video model with low-quality videos and then fine-tuning the spatial modules with high-quality images. The authors also propose using synthesized images with complex concepts for fine-tuning to improve concept composition. The method is evaluated on a benchmark and compared with state-of-the-art text-to-video generation models, showing that it achieves comparable or better performance in terms of visual quality, motion quality, and text-video alignment. The results indicate that the proposed method can generate high-quality videos without requiring high-quality videos for training.
Reach us at info@study.space
[slides] VideoCrafter2%3A Overcoming Data Limitations for High-Quality Video Diffusion Models | StudySpace