25 Mar 2024 | Chengxuan Li, Di Huang, Zeyu Lu, Yang Xiao, Qingqi Pei, Lei Bai
This paper presents a comprehensive survey of recent advancements in long video generation, summarizing them into two key paradigms: divide and conquer and temporal autoregressive. The paper discusses common models used in each paradigm, including aspects of network design and conditioning techniques. It also provides a comprehensive overview and classification of datasets and evaluation metrics crucial for advancing long video generation research. The paper concludes with a summary of existing studies, discussing emerging challenges and future directions in this dynamic field.
Long video generation faces challenges such as limited computational resources, the inability of existing models to directly produce long videos, and the scarcity of long video datasets. To address these challenges, two paradigms have been proposed: divide and conquer, which involves generating keyframes and filling in the intervening frames, and temporal autoregressive, which generates short video segments based on prior conditions. These paradigms aim to simplify the task of long video generation into smaller, more manageable processes.
The paper also discusses various video generation techniques, including diffusion models, autoregressive models, generative adversarial networks (GANs), and mask modeling. These techniques are used to generate long videos with high quality and coherence. The paper highlights the importance of control signals in video generation, such as text prompts, image prompts, and video prompts, which guide the generation process.
The paper further explores the challenges in long video generation, including the need for high-quality long video datasets, the complexity of long video content, and the difficulty of maintaining temporal consistency and continuity in long videos. It also discusses the importance of computational, memory, and data resources in long video generation and proposes resource-saving techniques such as data compression and lightweight model design.
The paper concludes with a summary of the current state of long video generation and discusses future directions, including the generation of super-long videos, enhanced controllability, and real-world simulation. The paper emphasizes the need for further research to address the challenges in long video generation and to advance the field.This paper presents a comprehensive survey of recent advancements in long video generation, summarizing them into two key paradigms: divide and conquer and temporal autoregressive. The paper discusses common models used in each paradigm, including aspects of network design and conditioning techniques. It also provides a comprehensive overview and classification of datasets and evaluation metrics crucial for advancing long video generation research. The paper concludes with a summary of existing studies, discussing emerging challenges and future directions in this dynamic field.
Long video generation faces challenges such as limited computational resources, the inability of existing models to directly produce long videos, and the scarcity of long video datasets. To address these challenges, two paradigms have been proposed: divide and conquer, which involves generating keyframes and filling in the intervening frames, and temporal autoregressive, which generates short video segments based on prior conditions. These paradigms aim to simplify the task of long video generation into smaller, more manageable processes.
The paper also discusses various video generation techniques, including diffusion models, autoregressive models, generative adversarial networks (GANs), and mask modeling. These techniques are used to generate long videos with high quality and coherence. The paper highlights the importance of control signals in video generation, such as text prompts, image prompts, and video prompts, which guide the generation process.
The paper further explores the challenges in long video generation, including the need for high-quality long video datasets, the complexity of long video content, and the difficulty of maintaining temporal consistency and continuity in long videos. It also discusses the importance of computational, memory, and data resources in long video generation and proposes resource-saving techniques such as data compression and lightweight model design.
The paper concludes with a summary of the current state of long video generation and discusses future directions, including the generation of super-long videos, enhanced controllability, and real-world simulation. The paper emphasizes the need for further research to address the challenges in long video generation and to advance the field.