Latte: Latent Diffusion Transformer for Video Generation

Latte: Latent Diffusion Transformer for Video Generation

5 Jan 2024 | Xin Ma¹,², Yaohui Wang²*, Gengyun Jia³, Xinyuan Chen², Ziwei Liu⁴, Yuan-Fang Li¹, Cunjian Chen¹, Yu Qiao²
Latte is a novel latent diffusion transformer for video generation, designed to efficiently model spatio-temporal information in videos. The model first extracts spatio-temporal tokens from input videos and uses a series of Transformer blocks to model video distribution in the latent space. Four efficient variants of the model are introduced to handle the complexity of video data by decomposing spatial and temporal dimensions. Through rigorous experimental analysis, the best practices for Latte are determined, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Latte achieves state-of-the-art performance on four standard video generation datasets: FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. Additionally, Latte is extended to text-to-video generation, achieving comparable results to recent T2V models. The model's effectiveness is demonstrated through comprehensive experiments, showing that it generates photorealistic videos with temporal coherence. Latte provides valuable insights for future research on integrating Transformers into diffusion models for video generation.Latte is a novel latent diffusion transformer for video generation, designed to efficiently model spatio-temporal information in videos. The model first extracts spatio-temporal tokens from input videos and uses a series of Transformer blocks to model video distribution in the latent space. Four efficient variants of the model are introduced to handle the complexity of video data by decomposing spatial and temporal dimensions. Through rigorous experimental analysis, the best practices for Latte are determined, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Latte achieves state-of-the-art performance on four standard video generation datasets: FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. Additionally, Latte is extended to text-to-video generation, achieving comparable results to recent T2V models. The model's effectiveness is demonstrated through comprehensive experiments, showing that it generates photorealistic videos with temporal coherence. Latte provides valuable insights for future research on integrating Transformers into diffusion models for video generation.
Reach us at info@study.space
[slides and audio] Latte%3A Latent Diffusion Transformer for Video Generation