[slides] From Sora What We Can See%3A A Survey of Text-to-Video Generation

This paper provides a comprehensive survey of text-to-video (T2V) generation, focusing on the advancements and challenges highlighted by OpenAI's Sora. Sora, capable of generating minute-long, high-quality videos from textual instructions, represents a significant milestone in the field. The survey categorizes the literature into three dimensions: evolutionary generators, excellent pursuit, and realistic panorama. It also details widely used datasets and evaluation metrics, identifies key challenges, and proposes future research directions. The foundational models and algorithms, such as GANs, VAEs, diffusion models, autoregressive models, and transformers, are discussed in depth. The paper highlights the evolution of T2V models, from early VAE and GAN-based approaches to more advanced diffusion and autoregressive models. It also explores the pursuit of extended video duration, superior resolution, and seamless quality, as well as the challenges of generating realistic video outputs with dynamic motion, complex scenes, multiple objects, and rational layout. The survey concludes with a synthesis of insights and implications, emphasizing the need for further research to address the remaining challenges in T2V generation.This paper provides a comprehensive survey of text-to-video (T2V) generation, focusing on the advancements and challenges highlighted by OpenAI's Sora. Sora, capable of generating minute-long, high-quality videos from textual instructions, represents a significant milestone in the field. The survey categorizes the literature into three dimensions: evolutionary generators, excellent pursuit, and realistic panorama. It also details widely used datasets and evaluation metrics, identifies key challenges, and proposes future research directions. The foundational models and algorithms, such as GANs, VAEs, diffusion models, autoregressive models, and transformers, are discussed in depth. The paper highlights the evolution of T2V models, from early VAE and GAN-based approaches to more advanced diffusion and autoregressive models. It also explores the pursuit of extended video duration, superior resolution, and seamless quality, as well as the challenges of generating realistic video outputs with dynamic motion, complex scenes, multiple objects, and rational layout. The survey concludes with a synthesis of insights and implications, emphasizing the need for further research to address the remaining challenges in T2V generation.

From Sora What We Can See: A Survey of Text-to-Video Generation

17 May 2024 | Rui Sun*, Yumin Zhang†‡, Tejal Shah, Jiahao Sun, Shuoying Zhang, Wenqi Li, Haoran Duan, Bo Wei, Rajiv Ranjan Fellow, IEEE