The paper introduces Latte, a novel latent diffusion transformer for video generation. Latte first extracts spatio-temporal tokens from input videos and then uses a series of Transformer blocks to model the video distribution in the latent space. To efficiently handle the large number of tokens and the spatial-temporal dimensions of videos, four efficient variants of the Transformer are introduced. The authors conduct a comprehensive ablation study to determine the best practices for Latte, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Experimental results on four standard video generation datasets (FaceForensics, SkyTimelapse, UCF101, and Taichi-HD) demonstrate that Latte achieves state-of-the-art performance. Additionally, Latte is extended to text-to-video generation (T2V) tasks, achieving comparable results to current T2V models. The paper provides valuable insights for future research on integrating Transformers into diffusion models for video generation.The paper introduces Latte, a novel latent diffusion transformer for video generation. Latte first extracts spatio-temporal tokens from input videos and then uses a series of Transformer blocks to model the video distribution in the latent space. To efficiently handle the large number of tokens and the spatial-temporal dimensions of videos, four efficient variants of the Transformer are introduced. The authors conduct a comprehensive ablation study to determine the best practices for Latte, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Experimental results on four standard video generation datasets (FaceForensics, SkyTimelapse, UCF101, and Taichi-HD) demonstrate that Latte achieves state-of-the-art performance. Additionally, Latte is extended to text-to-video generation (T2V) tasks, achieving comparable results to current T2V models. The paper provides valuable insights for future research on integrating Transformers into diffusion models for video generation.