2 Jan 2024 | Fuchen Long, Zhaofan Qiu, Ting Yao and Tao Mei
VideoDrafter is a novel framework for generating content-consistent multi-scene videos. It leverages Large Language Models (LLMs) to convert input prompts into comprehensive multi-scene scripts, which are then used to generate reference images for key entities. These reference images are used to ensure visual consistency across scenes. The framework uses two diffusion models, VideoDrafter-Img and VideoDrafter-Vid, to generate scene-reference images and video clips, respectively. VideoDrafter-Img generates scene-reference images based on event prompts and entity reference images, while VideoDrafter-Vid generates video clips by incorporating the scene-reference image, action dynamics, and camera movement. The framework is evaluated on several benchmarks, demonstrating superior visual quality, content consistency, and user preference compared to state-of-the-art models. The use of LLMs allows for logical and coherent multi-scene video generation, while the reference images ensure consistency across scenes. The framework is effective in generating multi-scene videos with consistent visual appearance and logical flow.VideoDrafter is a novel framework for generating content-consistent multi-scene videos. It leverages Large Language Models (LLMs) to convert input prompts into comprehensive multi-scene scripts, which are then used to generate reference images for key entities. These reference images are used to ensure visual consistency across scenes. The framework uses two diffusion models, VideoDrafter-Img and VideoDrafter-Vid, to generate scene-reference images and video clips, respectively. VideoDrafter-Img generates scene-reference images based on event prompts and entity reference images, while VideoDrafter-Vid generates video clips by incorporating the scene-reference image, action dynamics, and camera movement. The framework is evaluated on several benchmarks, demonstrating superior visual quality, content consistency, and user preference compared to state-of-the-art models. The use of LLMs allows for logical and coherent multi-scene video generation, while the reference images ensure consistency across scenes. The framework is effective in generating multi-scene videos with consistent visual appearance and logical flow.