Understanding CoCoCo%3A Improving Text-Guided Video Inpainting for Better Consistency%2C Controllability and Compatibility

CoCoCo is a novel text-guided video inpainting model designed to improve motion consistency, textual controllability, and model compatibility. The paper addresses the limitations of existing methods, such as poor text-video alignment and low consistency, by introducing several key innovations: 1. **Motion Capture Module**: This module includes two temporal attention layers, a damped global attention layer, and a textual cross-attention layer to enhance motion consistency and text-video alignment. 2. **Instance-Aware Region Selection**: Instead of random mask selection, this strategy uses Grounding DINO to detect and align regions in the first frame with the rest of the frames, ensuring better text-video controllability. 3. **Model Compatibility**: A novel strategy is proposed to transform personalized text-to-image (T2I) models to be compatible with the video inpainting model, allowing for the integration of customized content in the masked regions. The method is evaluated through extensive experiments, demonstrating superior performance in motion consistency, textual controllability, and model compatibility compared to existing methods. The paper also includes a detailed ablation study to validate the effectiveness of each component and provides qualitative and quantitative results to support its claims.CoCoCo is a novel text-guided video inpainting model designed to improve motion consistency, textual controllability, and model compatibility. The paper addresses the limitations of existing methods, such as poor text-video alignment and low consistency, by introducing several key innovations: 1. **Motion Capture Module**: This module includes two temporal attention layers, a damped global attention layer, and a textual cross-attention layer to enhance motion consistency and text-video alignment. 2. **Instance-Aware Region Selection**: Instead of random mask selection, this strategy uses Grounding DINO to detect and align regions in the first frame with the rest of the frames, ensuring better text-video controllability. 3. **Model Compatibility**: A novel strategy is proposed to transform personalized text-to-image (T2I) models to be compatible with the video inpainting model, allowing for the integration of customized content in the masked regions. The method is evaluated through extensive experiments, demonstrating superior performance in motion consistency, textual controllability, and model compatibility compared to existing methods. The paper also includes a detailed ablation study to validate the effectiveness of each component and provides qualitative and quantitative results to support its claims.

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

18 Mar 2024 | Bojia Zi1, Shihao Zhao3, Xianbiao Qi*2, Jianan Wang2, Yukai Shi4, Qianyu Chen1, Bin Liang1, Kam-Fai Wong1, and Lei Zhang2