The paper "VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models" by Haoxin Chen et al. addresses the challenge of training high-quality video diffusion models without access to high-quality videos. The authors explore the training schemes of video models based on Stable Diffusion (SD) and investigate the connection between spatial and temporal modules. They observe that fully training all modules results in stronger coupling between spatial and temporal modules, which allows for better motion consistency and picture quality when fine-tuned with high-quality images. The proposed method disentangles motion from appearance at the data level, using low-quality videos for motion learning and high-quality images for appearance learning. Evaluations demonstrate the effectiveness of the method in generating high-quality videos with minimal noise, excellent details, and high aesthetic scores, outperforming existing models in visual quality, motion, and concept composition. The paper also includes a user study to validate the method's performance in terms of visual quality and motion quality.The paper "VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models" by Haoxin Chen et al. addresses the challenge of training high-quality video diffusion models without access to high-quality videos. The authors explore the training schemes of video models based on Stable Diffusion (SD) and investigate the connection between spatial and temporal modules. They observe that fully training all modules results in stronger coupling between spatial and temporal modules, which allows for better motion consistency and picture quality when fine-tuned with high-quality images. The proposed method disentangles motion from appearance at the data level, using low-quality videos for motion learning and high-quality images for appearance learning. Evaluations demonstrate the effectiveness of the method in generating high-quality videos with minimal noise, excellent details, and high aesthetic scores, outperforming existing models in visual quality, motion, and concept composition. The paper also includes a user study to validate the method's performance in terms of visual quality and motion quality.