MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators

MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators

7 Apr 2024 | Shenghai Yuan, Jinfu Huang, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, Jiebo Luo
MagicTime is a novel time-lapse video generation model that learns real-world physics knowledge from time-lapse videos to generate metamorphic videos. The model addresses the limitations of existing text-to-video (T2V) models that lack physical knowledge, resulting in limited motion and poor variation in generated videos. MagicTime introduces a MagicAdapter to decouple spatial and temporal training, enabling the model to encode more physical knowledge from metamorphic videos. It also employs a Dynamic Frames Extraction strategy to adapt to the characteristics of time-lapse videos, which have a wider variation range and cover dramatic object metamorphic processes. Additionally, a Magic Text-Encoder is introduced to improve the understanding of metamorphic video prompts. A time-lapse video-text dataset called ChronoMagic is created to facilitate metamorphic video generation. Extensive experiments demonstrate that MagicTime can generate high-quality and dynamic metamorphic videos, suggesting that time-lapse video generation is a promising path toward building metamorphic simulators of the physical world. The model is evaluated using metrics such as Fréchet Inception Distance (FID), Fréchet Video Distance (FVD), and CLIP Similarity (CLIPSIM). The results show that MagicTime outperforms other T2V models in generating metamorphic videos. The model is also integrated into the DiT-based architecture to support the Open-Sora-Plan v1.0.0 framework. The results demonstrate that MagicTime can generate high-quality metamorphic landscape time-lapse videos. The model's effectiveness is validated through extensive experiments and human evaluations, showing that it can generate visually consistent and semantically relevant metamorphic videos. The model's ability to encode physical knowledge from time-lapse videos enables it to generate videos that accurately reflect real-world physical processes. The model's success is attributed to its ability to incorporate physical priors into the generation process, resulting in high-quality and consistent metamorphic videos.MagicTime is a novel time-lapse video generation model that learns real-world physics knowledge from time-lapse videos to generate metamorphic videos. The model addresses the limitations of existing text-to-video (T2V) models that lack physical knowledge, resulting in limited motion and poor variation in generated videos. MagicTime introduces a MagicAdapter to decouple spatial and temporal training, enabling the model to encode more physical knowledge from metamorphic videos. It also employs a Dynamic Frames Extraction strategy to adapt to the characteristics of time-lapse videos, which have a wider variation range and cover dramatic object metamorphic processes. Additionally, a Magic Text-Encoder is introduced to improve the understanding of metamorphic video prompts. A time-lapse video-text dataset called ChronoMagic is created to facilitate metamorphic video generation. Extensive experiments demonstrate that MagicTime can generate high-quality and dynamic metamorphic videos, suggesting that time-lapse video generation is a promising path toward building metamorphic simulators of the physical world. The model is evaluated using metrics such as Fréchet Inception Distance (FID), Fréchet Video Distance (FVD), and CLIP Similarity (CLIPSIM). The results show that MagicTime outperforms other T2V models in generating metamorphic videos. The model is also integrated into the DiT-based architecture to support the Open-Sora-Plan v1.0.0 framework. The results demonstrate that MagicTime can generate high-quality metamorphic landscape time-lapse videos. The model's effectiveness is validated through extensive experiments and human evaluations, showing that it can generate visually consistent and semantically relevant metamorphic videos. The model's ability to encode physical knowledge from time-lapse videos enables it to generate videos that accurately reflect real-world physical processes. The model's success is attributed to its ability to incorporate physical priors into the generation process, resulting in high-quality and consistent metamorphic videos.
Reach us at info@study.space