ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation

ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation

07/2024 | Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, Wenhuchen
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation This paper proposes ConsistI2V, a diffusion-based method to enhance visual consistency in image-to-video (I2V) generation. The main challenge in I2V generation is maintaining visual consistency throughout the video, as existing methods often fail to preserve the integrity of the subject, background, and style from the first frame, and ensure a fluid and logical progression within the video narrative. To address these issues, ConsistI2V introduces spatiotemporal attention over the first frame to maintain spatial and motion consistency, and noise initialization from the low-frequency band of the first frame to enhance layout consistency. These approaches enable ConsistI2V to generate highly consistent videos. The paper also extends these approaches to show their potential to improve consistency in auto-regressive long video generation and camera motion control. To verify the effectiveness of the method, the authors propose I2V-Bench, a comprehensive evaluation benchmark for I2V generation. Automatic and human evaluation results demonstrate the superiority of ConsistI2V over existing methods. The paper also presents a detailed methodology for ConsistI2V, including the model architecture, fine-grained spatial feature conditioning, window-based temporal feature conditioning, and inference-time layout-guided noise initialization. The authors evaluate their method on several benchmarks, including UCF-101 and MSR-VTT, and show that ConsistI2V significantly outperforms other I2V generation models in terms of visual quality, consistency, and video-text alignment. The paper also discusses the limitations of the current method, including the use of a low-resolution training dataset, limited motion magnitude, and the need for tuning spatial U-Net layers during training. The authors also discuss the potential applications of ConsistI2V, including autoregressive long video generation and camera motion control. The paper concludes that ConsistI2V is a promising approach for I2V generation, with the potential to be further improved in the future.ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation This paper proposes ConsistI2V, a diffusion-based method to enhance visual consistency in image-to-video (I2V) generation. The main challenge in I2V generation is maintaining visual consistency throughout the video, as existing methods often fail to preserve the integrity of the subject, background, and style from the first frame, and ensure a fluid and logical progression within the video narrative. To address these issues, ConsistI2V introduces spatiotemporal attention over the first frame to maintain spatial and motion consistency, and noise initialization from the low-frequency band of the first frame to enhance layout consistency. These approaches enable ConsistI2V to generate highly consistent videos. The paper also extends these approaches to show their potential to improve consistency in auto-regressive long video generation and camera motion control. To verify the effectiveness of the method, the authors propose I2V-Bench, a comprehensive evaluation benchmark for I2V generation. Automatic and human evaluation results demonstrate the superiority of ConsistI2V over existing methods. The paper also presents a detailed methodology for ConsistI2V, including the model architecture, fine-grained spatial feature conditioning, window-based temporal feature conditioning, and inference-time layout-guided noise initialization. The authors evaluate their method on several benchmarks, including UCF-101 and MSR-VTT, and show that ConsistI2V significantly outperforms other I2V generation models in terms of visual quality, consistency, and video-text alignment. The paper also discusses the limitations of the current method, including the use of a low-resolution training dataset, limited motion magnitude, and the need for tuning spatial U-Net layers during training. The authors also discuss the potential applications of ConsistI2V, including autoregressive long video generation and camera motion control. The paper concludes that ConsistI2V is a promising approach for I2V generation, with the potential to be further improved in the future.
Reach us at info@study.space
Understanding ConsistI2V%3A Enhancing Visual Consistency for Image-to-Video Generation