29 Sep 2022 | Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, Yaniv Taigman
Make-A-Video is an innovative approach that leverages recent advancements in Text-to-Image (T2I) generation to achieve Text-to-Video (T2V) generation without the need for paired text-video data. The method learns the correspondence between text and visual descriptions from text-image pairs and uses unsupervised learning on unlabeled video data to learn realistic motion. This approach has three key advantages: it accelerates T2V model training, does not require paired text-video data, and generates diverse and high-quality videos similar to those produced by T2I models.
The Make-A-Video architecture consists of three main components:
1. **Text-to-Image Model**: A pre-trained T2I model that generates image embeddings from text.
2. **Spatiotemporal Layers**: Pseudo-3D convolutional and attention layers that extend the spatial layers to include temporal information, enabling the model to generate videos.
3. **Spatiotemporal Networks**: Networks that include spatiotemporal layers and a frame interpolation network to generate high-resolution and high frame rate videos.
Key contributions of Make-A-Video include:
- Extending a diffusion-based T2I model to T2V through spatiotemporally factorized diffusion.
- Leveraging joint text-image priors to bypass the need for paired text-video data.
- Implementing super-resolution strategies in space and time to generate high-definition, high frame rate videos.
The method is evaluated using various datasets and metrics, demonstrating state-of-the-art performance in both quantitative and qualitative measures. Human evaluation further confirms the superior quality and text-video faithfulness of Make-A-Video's generated videos compared to existing T2V systems.Make-A-Video is an innovative approach that leverages recent advancements in Text-to-Image (T2I) generation to achieve Text-to-Video (T2V) generation without the need for paired text-video data. The method learns the correspondence between text and visual descriptions from text-image pairs and uses unsupervised learning on unlabeled video data to learn realistic motion. This approach has three key advantages: it accelerates T2V model training, does not require paired text-video data, and generates diverse and high-quality videos similar to those produced by T2I models.
The Make-A-Video architecture consists of three main components:
1. **Text-to-Image Model**: A pre-trained T2I model that generates image embeddings from text.
2. **Spatiotemporal Layers**: Pseudo-3D convolutional and attention layers that extend the spatial layers to include temporal information, enabling the model to generate videos.
3. **Spatiotemporal Networks**: Networks that include spatiotemporal layers and a frame interpolation network to generate high-resolution and high frame rate videos.
Key contributions of Make-A-Video include:
- Extending a diffusion-based T2I model to T2V through spatiotemporally factorized diffusion.
- Leveraging joint text-image priors to bypass the need for paired text-video data.
- Implementing super-resolution strategies in space and time to generate high-definition, high frame rate videos.
The method is evaluated using various datasets and metrics, demonstrating state-of-the-art performance in both quantitative and qualitative measures. Human evaluation further confirms the superior quality and text-video faithfulness of Make-A-Video's generated videos compared to existing T2V systems.