[slides and audio] Make-A-Video%3A Text-to-Video Generation without Text-Video Data

Make-A-Video is an innovative approach that leverages recent advancements in Text-to-Image (T2I) generation to achieve Text-to-Video (T2V) generation without the need for paired text-video data. The method learns the correspondence between text and visual descriptions from text-image pairs and uses unsupervised learning on unlabeled video data to learn realistic motion. This approach has three key advantages: it accelerates T2V model training, does not require paired text-video data, and generates diverse and high-quality videos similar to those produced by T2I models. The Make-A-Video architecture consists of three main components: 1. **Text-to-Image Model**: A pre-trained T2I model that generates image embeddings from text. 2. **Spatiotemporal Layers**: Pseudo-3D convolutional and attention layers that extend the spatial layers to include temporal information, enabling the model to generate videos. 3. **Spatiotemporal Networks**: Networks that include spatiotemporal layers and a frame interpolation network to generate high-resolution and high frame rate videos. Key contributions of Make-A-Video include: - Extending a diffusion-based T2I model to T2V through spatiotemporally factorized diffusion. - Leveraging joint text-image priors to bypass the need for paired text-video data. - Implementing super-resolution strategies in space and time to generate high-definition, high frame rate videos. The method is evaluated using various datasets and metrics, demonstrating state-of-the-art performance in both quantitative and qualitative measures. Human evaluation further confirms the superior quality and text-video faithfulness of Make-A-Video's generated videos compared to existing T2V systems.Make-A-Video is an innovative approach that leverages recent advancements in Text-to-Image (T2I) generation to achieve Text-to-Video (T2V) generation without the need for paired text-video data. The method learns the correspondence between text and visual descriptions from text-image pairs and uses unsupervised learning on unlabeled video data to learn realistic motion. This approach has three key advantages: it accelerates T2V model training, does not require paired text-video data, and generates diverse and high-quality videos similar to those produced by T2I models. The Make-A-Video architecture consists of three main components: 1. **Text-to-Image Model**: A pre-trained T2I model that generates image embeddings from text. 2. **Spatiotemporal Layers**: Pseudo-3D convolutional and attention layers that extend the spatial layers to include temporal information, enabling the model to generate videos. 3. **Spatiotemporal Networks**: Networks that include spatiotemporal layers and a frame interpolation network to generate high-resolution and high frame rate videos. Key contributions of Make-A-Video include: - Extending a diffusion-based T2I model to T2V through spatiotemporally factorized diffusion. - Leveraging joint text-image priors to bypass the need for paired text-video data. - Implementing super-resolution strategies in space and time to generate high-definition, high frame rate videos. The method is evaluated using various datasets and metrics, demonstrating state-of-the-art performance in both quantitative and qualitative measures. Human evaluation further confirms the superior quality and text-video faithfulness of Make-A-Video's generated videos compared to existing T2V systems.

MAKE-A-VIDEO: TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA

29 Sep 2022 | Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, Yaniv Taigman