[slides] Moonshot%3A Towards Controllable Video Generation and Editing with Multimodal Conditions

MoonShot is a new video generation model that conditions on both image and text inputs, enabling controllable video generation and editing. The model uses a core module called the Multimodal Video Block (MVB), which includes conventional spatial-temporal layers for video feature representation and a decoupled cross-attention layer for handling image and text inputs. The MVB allows the model to integrate pre-trained image ControlNet modules for geometry control without additional training. Experiments show that MoonShot significantly improves visual quality and temporal consistency compared to existing models. It can be repurposed for various generative applications, such as personalized video generation, image animation, and video editing. The model is publicly available on GitHub. MoonShot outperforms other methods in image animation, video editing, and text-to-video generation, demonstrating superior performance in terms of visual quality, temporal consistency, and text alignment. The model also shows improved results when using multimodal conditions, including image and text inputs. The model is designed to be generic and versatile, allowing it to be adapted for various video generation tasks. MoonShot is trained on a public dataset and achieves strong results in zero-shot customization. The model is also evaluated for ethical considerations, including the potential for harmful content generation, and measures are taken to mitigate such risks. Overall, MoonShot is a promising foundation model for video generation and editing.MoonShot is a new video generation model that conditions on both image and text inputs, enabling controllable video generation and editing. The model uses a core module called the Multimodal Video Block (MVB), which includes conventional spatial-temporal layers for video feature representation and a decoupled cross-attention layer for handling image and text inputs. The MVB allows the model to integrate pre-trained image ControlNet modules for geometry control without additional training. Experiments show that MoonShot significantly improves visual quality and temporal consistency compared to existing models. It can be repurposed for various generative applications, such as personalized video generation, image animation, and video editing. The model is publicly available on GitHub. MoonShot outperforms other methods in image animation, video editing, and text-to-video generation, demonstrating superior performance in terms of visual quality, temporal consistency, and text alignment. The model also shows improved results when using multimodal conditions, including image and text inputs. The model is designed to be generic and versatile, allowing it to be adapted for various video generation tasks. MoonShot is trained on a public dataset and achieves strong results in zero-shot customization. The model is also evaluated for ethical considerations, including the potential for harmful content generation, and measures are taken to mitigate such risks. Overall, MoonShot is a promising foundation model for video generation and editing.

MoonShot: Towards Controllable Video Generation and Editing with Multimodal Conditions

3 Jan 2024 | David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo