This survey provides a comprehensive overview of text-to-video (T2V) generation, focusing on the capabilities and challenges of OpenAI's Sora. Sora is a significant milestone in the development of artificial general intelligence, capable of generating high-quality, long-duration videos with realistic and imaginative scenes based on textual instructions. It utilizes the DiT model to efficiently learn from large-scale internet data, surpassing traditional U-Net architectures. Sora excels in generating videos with high resolution and seamless quality, and it can produce minute-long videos, a feat not yet achieved by existing T2V studies. However, it faces challenges in ensuring coherent motion among objects and generating complex scenes with intricate details.
The survey categorizes T2V generation into three dimensions: evolutionary generators, excellent pursuit, and realistic panorama. It reviews widely used datasets and metrics, identifies challenges and open problems in the field, and proposes future research directions. The survey also discusses foundational models and algorithms, including GANs, VAEs, diffusion models, autoregressive models, and transformers, which are crucial for T2V generation. It highlights the advancements in T2V generation, such as the development of diffusion models, the integration of transformers, and the use of autoregressive models for efficient video generation. The survey also addresses challenges in T2V generation, including maintaining temporal coherence, achieving high-resolution video generation, and ensuring seamless video quality. The survey concludes with a discussion of future directions for research and development in T2V generation.This survey provides a comprehensive overview of text-to-video (T2V) generation, focusing on the capabilities and challenges of OpenAI's Sora. Sora is a significant milestone in the development of artificial general intelligence, capable of generating high-quality, long-duration videos with realistic and imaginative scenes based on textual instructions. It utilizes the DiT model to efficiently learn from large-scale internet data, surpassing traditional U-Net architectures. Sora excels in generating videos with high resolution and seamless quality, and it can produce minute-long videos, a feat not yet achieved by existing T2V studies. However, it faces challenges in ensuring coherent motion among objects and generating complex scenes with intricate details.
The survey categorizes T2V generation into three dimensions: evolutionary generators, excellent pursuit, and realistic panorama. It reviews widely used datasets and metrics, identifies challenges and open problems in the field, and proposes future research directions. The survey also discusses foundational models and algorithms, including GANs, VAEs, diffusion models, autoregressive models, and transformers, which are crucial for T2V generation. It highlights the advancements in T2V generation, such as the development of diffusion models, the integration of transformers, and the use of autoregressive models for efficient video generation. The survey also addresses challenges in T2V generation, including maintaining temporal coherence, achieving high-resolution video generation, and ensuring seamless video quality. The survey concludes with a discussion of future directions for research and development in T2V generation.