Latte: Latent Diffusion Transformer for Video Generation

Latte: Latent Diffusion Transformer for Video Generation

5 Jan 2024 | Xin Ma1,2, Yaohui Wang2*, Gengyun Jia3, Xinyuan Chen2, Ziwei Liu4, Yuan-Fang Li1, Cunjian Chen1, Yu Qiao2
The paper introduces Latte, a novel latent diffusion transformer for video generation. Latte first extracts spatio-temporal tokens from input videos and then uses a series of Transformer blocks to model the video distribution in the latent space. To efficiently handle the large number of tokens and the spatial-temporal dimensions of videos, four efficient variants of the Transformer are introduced. The authors conduct a comprehensive ablation study to determine the best practices for Latte, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Experimental results on four standard video generation datasets (FaceForensics, SkyTimelapse, UCF101, and Taichi-HD) demonstrate that Latte achieves state-of-the-art performance. Additionally, Latte is extended to text-to-video generation (T2V) tasks, achieving comparable results to current T2V models. The paper provides valuable insights for future research on integrating Transformers into diffusion models for video generation.The paper introduces Latte, a novel latent diffusion transformer for video generation. Latte first extracts spatio-temporal tokens from input videos and then uses a series of Transformer blocks to model the video distribution in the latent space. To efficiently handle the large number of tokens and the spatial-temporal dimensions of videos, four efficient variants of the Transformer are introduced. The authors conduct a comprehensive ablation study to determine the best practices for Latte, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Experimental results on four standard video generation datasets (FaceForensics, SkyTimelapse, UCF101, and Taichi-HD) demonstrate that Latte achieves state-of-the-art performance. Additionally, Latte is extended to text-to-video generation (T2V) tasks, achieving comparable results to current T2V models. The paper provides valuable insights for future research on integrating Transformers into diffusion models for video generation.
Reach us at info@study.space
[slides and audio] Latte%3A Latent Diffusion Transformer for Video Generation