Still-Moving: Customized Video Generation without Customized Video Data

Still-Moving: Customized Video Generation without Customized Video Data

11 Jul 2024 | Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, Inbar Mosseri
Still-Moving is a novel framework for customizing text-to-video (T2V) models without requiring customized video data. The framework leverages a text-to-image (T2I) model, which has been fine-tuned on still images, and integrates it into a T2V model. To address the challenge of aligning the T2I model's spatial prior with the T2V model's temporal prior, the framework introduces lightweight Spatial Adapters and Motion Adapters. The Motion Adapters are trained on "frozen" videos generated by the T2I model, enabling the T2V model to retain its motion prior while adapting to the customized T2I model. The Spatial Adapters adjust the feature distribution to ensure the T2V model adheres to the customized T2I model's spatial prior. The framework is tested on various tasks, including personalized, stylized, and conditional video generation, demonstrating its effectiveness in combining the spatial prior of the T2I model with the motion prior of the T2V model. The method is generic, lightweight, and applicable to any T2V model built on a T2I model. The framework outperforms existing approaches in terms of fidelity and motion quality, as demonstrated through extensive experiments and user studies.Still-Moving is a novel framework for customizing text-to-video (T2V) models without requiring customized video data. The framework leverages a text-to-image (T2I) model, which has been fine-tuned on still images, and integrates it into a T2V model. To address the challenge of aligning the T2I model's spatial prior with the T2V model's temporal prior, the framework introduces lightweight Spatial Adapters and Motion Adapters. The Motion Adapters are trained on "frozen" videos generated by the T2I model, enabling the T2V model to retain its motion prior while adapting to the customized T2I model. The Spatial Adapters adjust the feature distribution to ensure the T2V model adheres to the customized T2I model's spatial prior. The framework is tested on various tasks, including personalized, stylized, and conditional video generation, demonstrating its effectiveness in combining the spatial prior of the T2I model with the motion prior of the T2V model. The method is generic, lightweight, and applicable to any T2V model built on a T2I model. The framework outperforms existing approaches in terms of fidelity and motion quality, as demonstrated through extensive experiments and user studies.
Reach us at info@study.space