17 Jan 2024 | Shaobin Zhang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, Yali Wang
Vlogger is a generic AI system designed to generate minute-level video blogs (vlogs) based on user descriptions. Unlike short videos, vlogs often contain complex storylines and diverse scenes, making them challenging for existing video generation methods. To address this, Vlogger leverages a Large Language Model (LLM) as a director, decomposing the vlog generation task into four key stages: Script, Actor, ShowMaker, and Voicer. The Script stage involves creating a detailed plan for the vlog, the Actor stage involves designing characters, the ShowMaker stage generates video snippets with spatial-temporal coherence, and the Voicer stage adds audio narration. ShowMaker, a novel video diffusion model, enhances spatial-temporal coherence by using script and actor descriptions as prompts. Vlogger also employs a mixed training paradigm to improve its T2V generation and prediction capabilities. Extensive experiments show that Vlogger achieves state-of-the-art performance in zero-shot T2V generation and prediction tasks, generating over 5-minute vlogs from open-world descriptions without loss of video coherence. The system is available at https://Vlogger.github.io.Vlogger is a generic AI system designed to generate minute-level video blogs (vlogs) based on user descriptions. Unlike short videos, vlogs often contain complex storylines and diverse scenes, making them challenging for existing video generation methods. To address this, Vlogger leverages a Large Language Model (LLM) as a director, decomposing the vlog generation task into four key stages: Script, Actor, ShowMaker, and Voicer. The Script stage involves creating a detailed plan for the vlog, the Actor stage involves designing characters, the ShowMaker stage generates video snippets with spatial-temporal coherence, and the Voicer stage adds audio narration. ShowMaker, a novel video diffusion model, enhances spatial-temporal coherence by using script and actor descriptions as prompts. Vlogger also employs a mixed training paradigm to improve its T2V generation and prediction capabilities. Extensive experiments show that Vlogger achieves state-of-the-art performance in zero-shot T2V generation and prediction tasks, generating over 5-minute vlogs from open-world descriptions without loss of video coherence. The system is available at https://Vlogger.github.io.