CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

18 Mar 2024 | Bojia Zi¹, Shihao Zhao³, Xianbiao Qi², Jianan Wang², Yukai Shi⁴, Qianyu Chen¹, Bin Liang¹, Kam-Fai Wong¹, and Lei Zhang²
CoCoCo is a novel text-guided video inpainting model that improves consistency, controllability, and compatibility. The model introduces a motion capture module with damped global attention and textual cross-attention to enhance motion consistency and text-video alignment. It also employs an instance-aware region selection strategy to improve textual controllability and integrates personalized text-to-image (T2I) models through a task vector combination strategy to enhance compatibility. The model is trained on a large dataset and achieves high-quality video inpainting with better motion consistency, textual controllability, and compatibility compared to existing methods. Experiments show that CoCoCo outperforms other methods in terms of text alignment, background preservation, and temporal consistency. The model is compatible with various T2I models and can generate high-quality video content with better visual quality and motion smoothness. The results demonstrate the effectiveness of the proposed methods in improving text-guided video inpainting.CoCoCo is a novel text-guided video inpainting model that improves consistency, controllability, and compatibility. The model introduces a motion capture module with damped global attention and textual cross-attention to enhance motion consistency and text-video alignment. It also employs an instance-aware region selection strategy to improve textual controllability and integrates personalized text-to-image (T2I) models through a task vector combination strategy to enhance compatibility. The model is trained on a large dataset and achieves high-quality video inpainting with better motion consistency, textual controllability, and compatibility compared to existing methods. Experiments show that CoCoCo outperforms other methods in terms of text alignment, background preservation, and temporal consistency. The model is compatible with various T2I models and can generate high-quality video content with better visual quality and motion smoothness. The results demonstrate the effectiveness of the proposed methods in improving text-guided video inpainting.
Reach us at info@study.space