29 Sep 2022 | Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, Yaniv Taigman
Make-A-Video is a text-to-video generation model that leverages advancements in text-to-image (T2I) generation and unsupervised video learning. It directly translates the progress in T2I to text-to-video (T2V) by learning visual world descriptions from paired text-image data and motion from unsupervised video data. The model benefits from three key advantages: (1) faster training by reusing pre-trained T2I models, (2) no need for paired text-video data, and (3) generation of diverse, high-quality videos. The model extends T2I models with novel spatial-temporal modules, including decomposing the full temporal U-Net and attention tensors, and designing a spatial-temporal pipeline for high-resolution, high-frame-rate video generation. It also includes super-resolution models to enhance video quality. Make-A-Video achieves state-of-the-art results in T2V generation, both qualitatively and quantitatively. The model uses a spatiotemporally factorized diffusion model to extend a diffusion-based T2I model to T2V. It leverages joint text-image priors to avoid the need for paired text-video data, enabling scaling to larger video datasets. The model also introduces super-resolution strategies in space and time to generate high-definition, high-frame-rate videos from text. It evaluates against existing T2V systems, showing superior performance in both quantitative and qualitative measures. The model is trained on open-source datasets, making it easier to reproduce. Make-A-Video uses pseudo-3D convolution and temporal attention layers to better leverage T2I architectures and improve temporal information fusion. It also includes a frame interpolation network to increase frame rate and enable video generation. The model is trained on text-image pairs and then fine-tuned on unlabeled video data. It achieves high-quality video generation with coherent motion and faithful text alignment. The model outperforms existing T2V systems in both zero-shot and fine-tuning settings, and excels in video quality and text-video faithfulness. Make-A-Video also performs well in video interpolation tasks, generating more semantically meaningful results than existing methods. The model is evaluated on multiple benchmarks and shows significant improvements in video generation capabilities. The model's architecture enables a wide range of applications, including image animation, video variation, and more. Make-A-Video is a significant advancement in T2V generation, offering high-quality, diverse, and realistic video generation from text.Make-A-Video is a text-to-video generation model that leverages advancements in text-to-image (T2I) generation and unsupervised video learning. It directly translates the progress in T2I to text-to-video (T2V) by learning visual world descriptions from paired text-image data and motion from unsupervised video data. The model benefits from three key advantages: (1) faster training by reusing pre-trained T2I models, (2) no need for paired text-video data, and (3) generation of diverse, high-quality videos. The model extends T2I models with novel spatial-temporal modules, including decomposing the full temporal U-Net and attention tensors, and designing a spatial-temporal pipeline for high-resolution, high-frame-rate video generation. It also includes super-resolution models to enhance video quality. Make-A-Video achieves state-of-the-art results in T2V generation, both qualitatively and quantitatively. The model uses a spatiotemporally factorized diffusion model to extend a diffusion-based T2I model to T2V. It leverages joint text-image priors to avoid the need for paired text-video data, enabling scaling to larger video datasets. The model also introduces super-resolution strategies in space and time to generate high-definition, high-frame-rate videos from text. It evaluates against existing T2V systems, showing superior performance in both quantitative and qualitative measures. The model is trained on open-source datasets, making it easier to reproduce. Make-A-Video uses pseudo-3D convolution and temporal attention layers to better leverage T2I architectures and improve temporal information fusion. It also includes a frame interpolation network to increase frame rate and enable video generation. The model is trained on text-image pairs and then fine-tuned on unlabeled video data. It achieves high-quality video generation with coherent motion and faithful text alignment. The model outperforms existing T2V systems in both zero-shot and fine-tuning settings, and excels in video quality and text-video faithfulness. Make-A-Video also performs well in video interpolation tasks, generating more semantically meaningful results than existing methods. The model is evaluated on multiple benchmarks and shows significant improvements in video generation capabilities. The model's architecture enables a wide range of applications, including image animation, video variation, and more. Make-A-Video is a significant advancement in T2V generation, offering high-quality, diverse, and realistic video generation from text.