27 May 2024 | Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hongsheng Li, Leonidas Guibas, Gordon Wetzstein
Collaborative Video Diffusion (CVD) is introduced as a novel framework for generating consistent multi-video content with camera control. The framework includes a Cross-Video Synchronization Module that uses epipolar attention to ensure consistency between corresponding frames of videos rendered from different camera poses. CVD is trained on top of a state-of-the-art camera-control module for video generation, enabling the generation of multiple videos with significantly better consistency than baselines. The model is trained using a hybrid approach combining static data from RealEstate10K and dynamic data from WebVid10M. The Cross-Video Synchronization Module aligns features across diverse input videos to enhance consistency. A new collaborative inference algorithm extends the model to generate an arbitrary number of videos. CVD demonstrates superior performance in generating multi-view videos with consistent content and motion, outperforming all baseline methods. The model ensures strong geometric and semantic consistency, making it suitable for applications such as large-scale 3D scene generation. CVD is the first approach to generate multiple videos with consistent content and dynamics while providing camera control. The model is trained on video pair datasets and can generate more collaborative videos. The framework enables efficient information sharing across videos, resulting in a "collaborative diffusion" effect for unified video output. CVD represents a significant advancement in multi-view video synthesis, ensuring consistent dynamics across all generated videos. The model is capable of generating videos with consistent content and motion, maintaining and improving fidelity according to the prompt. The model is trained using a hybrid strategy combining static and dynamic data, and it demonstrates strong performance in generating videos with synchronized motions and high geometry consistency. The model is also capable of generating arbitrary views with consistent content and structure. CVD is the first approach to tackle the complexities of multi-view or multi-trajectory video synthesis, significantly advancing beyond existing multi-view image generation technologies. The model is trained using a hybrid strategy combining static and dynamic data, and it demonstrates strong performance in generating videos with synchronized motions and high geometry consistency. The model is also capable of generating arbitrary views with consistent content and structure. CVD is the first approach to tackle the complexities of multi-view or multi-trajectory video synthesis, significantly advancing beyond existing multi-view image generation technologies.Collaborative Video Diffusion (CVD) is introduced as a novel framework for generating consistent multi-video content with camera control. The framework includes a Cross-Video Synchronization Module that uses epipolar attention to ensure consistency between corresponding frames of videos rendered from different camera poses. CVD is trained on top of a state-of-the-art camera-control module for video generation, enabling the generation of multiple videos with significantly better consistency than baselines. The model is trained using a hybrid approach combining static data from RealEstate10K and dynamic data from WebVid10M. The Cross-Video Synchronization Module aligns features across diverse input videos to enhance consistency. A new collaborative inference algorithm extends the model to generate an arbitrary number of videos. CVD demonstrates superior performance in generating multi-view videos with consistent content and motion, outperforming all baseline methods. The model ensures strong geometric and semantic consistency, making it suitable for applications such as large-scale 3D scene generation. CVD is the first approach to generate multiple videos with consistent content and dynamics while providing camera control. The model is trained on video pair datasets and can generate more collaborative videos. The framework enables efficient information sharing across videos, resulting in a "collaborative diffusion" effect for unified video output. CVD represents a significant advancement in multi-view video synthesis, ensuring consistent dynamics across all generated videos. The model is capable of generating videos with consistent content and motion, maintaining and improving fidelity according to the prompt. The model is trained using a hybrid strategy combining static and dynamic data, and it demonstrates strong performance in generating videos with synchronized motions and high geometry consistency. The model is also capable of generating arbitrary views with consistent content and structure. CVD is the first approach to tackle the complexities of multi-view or multi-trajectory video synthesis, significantly advancing beyond existing multi-view image generation technologies. The model is trained using a hybrid strategy combining static and dynamic data, and it demonstrates strong performance in generating videos with synchronized motions and high geometry consistency. The model is also capable of generating arbitrary views with consistent content and structure. CVD is the first approach to tackle the complexities of multi-view or multi-trajectory video synthesis, significantly advancing beyond existing multi-view image generation technologies.