17 Jan 2024 | Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, Yali Wang
Vlogger is a generic AI system designed to generate minute-level video blogs (vlogs) based on user descriptions. Unlike short videos, vlogs often contain complex storylines and diverse scenes, making them challenging for existing video generation methods. To address this, Vlogger leverages a Large Language Model (LLM) as the Director, which decomposes the vlog generation task into four key stages: Script, Actor, ShowMaker, and Voicer. The Script stage uses a progressive creation paradigm to convert user stories into detailed scripts. The Actor stage generates reference images of actors for each scene. The ShowMaker stage, a novel video diffusion model, generates video snippets for each scene, enhancing spatial-temporal coherence with textual and visual prompts. The Voicer stage adds subtitles to the video snippets. Vlogger overcomes the challenges of long video generation by effectively planning and shooting, achieving state-of-the-art performance on zero-shot Text-to-Video (T2V) generation and prediction tasks. It can generate over 5-minute vlogs with coherent scripts and actors, outperforming existing methods like Phenaki. The system's code and models are available at <http://Vlogger.github.io>.Vlogger is a generic AI system designed to generate minute-level video blogs (vlogs) based on user descriptions. Unlike short videos, vlogs often contain complex storylines and diverse scenes, making them challenging for existing video generation methods. To address this, Vlogger leverages a Large Language Model (LLM) as the Director, which decomposes the vlog generation task into four key stages: Script, Actor, ShowMaker, and Voicer. The Script stage uses a progressive creation paradigm to convert user stories into detailed scripts. The Actor stage generates reference images of actors for each scene. The ShowMaker stage, a novel video diffusion model, generates video snippets for each scene, enhancing spatial-temporal coherence with textual and visual prompts. The Voicer stage adds subtitles to the video snippets. Vlogger overcomes the challenges of long video generation by effectively planning and shooting, achieving state-of-the-art performance on zero-shot Text-to-Video (T2V) generation and prediction tasks. It can generate over 5-minute vlogs with coherent scripts and actors, outperforming existing methods like Phenaki. The system's code and models are available at <http://Vlogger.github.io>.