[slides] Still-Moving%3A Customized Video Generation without Customized Video Data

Still-Moving is a novel framework for customizing text-to-video (T2V) models without requiring any customized video data. The method leverages a text-to-image (T2I) model, which has been fine-tuned on still images, and integrates it into a T2V model. To address the challenge of aligning the T2I model's spatial prior with the T2V model's temporal prior, the framework employs lightweight Spatial Adapters and Motion Adapters. The Motion Adapters are trained on "frozen" videos generated by the customized T2I model, enabling the T2V model to retain its motion prior while adapting to the customized data. The Spatial Adapters then adjust the feature distribution to ensure the T2V model adheres to the customized T2I model's spatial prior. This approach allows for personalized, stylized, and conditional video generation, seamlessly combining the spatial and motion priors of the respective models. The method has been demonstrated on a wide range of applications, including generating videos of personalized subjects, stylized content, and videos conditioned on ControlNet. The framework is generic, lightweight, and applicable to various T2V models built on T2I models. Extensive experiments show that Still-Moving outperforms existing methods in terms of fidelity and motion quality, and is capable of generating diverse and realistic videos without requiring customized video data. The method is also robust to different T2V inflation approaches and can be applied to a variety of tasks, including video customization and generation.Still-Moving is a novel framework for customizing text-to-video (T2V) models without requiring any customized video data. The method leverages a text-to-image (T2I) model, which has been fine-tuned on still images, and integrates it into a T2V model. To address the challenge of aligning the T2I model's spatial prior with the T2V model's temporal prior, the framework employs lightweight Spatial Adapters and Motion Adapters. The Motion Adapters are trained on "frozen" videos generated by the customized T2I model, enabling the T2V model to retain its motion prior while adapting to the customized data. The Spatial Adapters then adjust the feature distribution to ensure the T2V model adheres to the customized T2I model's spatial prior. This approach allows for personalized, stylized, and conditional video generation, seamlessly combining the spatial and motion priors of the respective models. The method has been demonstrated on a wide range of applications, including generating videos of personalized subjects, stylized content, and videos conditioned on ControlNet. The framework is generic, lightweight, and applicable to various T2V models built on T2I models. Extensive experiments show that Still-Moving outperforms existing methods in terms of fidelity and motion quality, and is capable of generating diverse and realistic videos without requiring customized video data. The method is also robust to different T2V inflation approaches and can be applied to a variety of tasks, including video customization and generation.

Still-Moving: Customized Video Generation without Customized Video Data

11 Jul 2024 | Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, Inbar Mosseri