[slides and audio] VideoStudio%3A Generating Consistent-Content and Multi-scene Videos

The paper introduces VideoDrafter, a novel framework for generating content-consistent multi-scene videos using Large Language Models (LLMs). VideoDrafter addresses the challenge of creating multi-scene videos by leveraging LLMs to convert input prompts into comprehensive multi-scene scripts. These scripts include event descriptions, foreground/background entities, and camera movements. The framework identifies common entities across scenes and generates reference images for each entity using a text-to-image model. Two diffusion models, VideoDrafter-Img and VideoDrafter-Vid, are used to generate each scene video, incorporating the reference images, event prompts, and camera movements. Extensive experiments on public benchmarks demonstrate that VideoDrafter outperforms state-of-the-art video generation models in terms of visual quality, content consistency, and user preference. The framework's effectiveness is validated through various evaluations, including human studies, which show improved logical coherence and content consistency.The paper introduces VideoDrafter, a novel framework for generating content-consistent multi-scene videos using Large Language Models (LLMs). VideoDrafter addresses the challenge of creating multi-scene videos by leveraging LLMs to convert input prompts into comprehensive multi-scene scripts. These scripts include event descriptions, foreground/background entities, and camera movements. The framework identifies common entities across scenes and generates reference images for each entity using a text-to-image model. Two diffusion models, VideoDrafter-Img and VideoDrafter-Vid, are used to generate each scene video, incorporating the reference images, event prompts, and camera movements. Extensive experiments on public benchmarks demonstrate that VideoDrafter outperforms state-of-the-art video generation models in terms of visual quality, content consistency, and user preference. The framework's effectiveness is validated through various evaluations, including human studies, which show improved logical coherence and content consistency.

VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM

2 Jan 2024 | Fuchen Long, Zhaofan Qiu, Ting Yao and Tao Mei