13 Mar 2024 | Enric Corona Andrei Zanfir Eduard Gabriel Bazavan Nikos Kolotouros Thiemo Alldieck Cristian Sminchisescu
VLOGGER is a novel framework for audio-driven human video generation from a single input image. It generates photorealistic and temporally coherent videos of a person talking and moving, including head motion, gaze, blinking, lip movement, and upper-body and hand gestures. The method consists of two main components: a stochastic human-to-3D-motion diffusion model and a novel diffusion-based architecture that enhances text-to-image models with spatial and temporal controls. This approach supports the generation of high-quality, variable-length videos with controllable human faces and bodies. VLOGGER does not require training for each individual, does not rely on face detection, and considers a broad spectrum of scenarios, making it more versatile than previous methods. The authors also introduce MENTOR, a large-scale dataset with 800,000 identities and dynamic gestures, which is used to train and evaluate VLOGGER. VLOGGER outperforms state-of-the-art methods in three public benchmarks, demonstrating superior image quality, identity preservation, and temporal consistency. The method is further evaluated on diversity metrics, showing low bias and outperforming baselines in various perceived human attributes. Applications in video editing and personalization are also discussed.VLOGGER is a novel framework for audio-driven human video generation from a single input image. It generates photorealistic and temporally coherent videos of a person talking and moving, including head motion, gaze, blinking, lip movement, and upper-body and hand gestures. The method consists of two main components: a stochastic human-to-3D-motion diffusion model and a novel diffusion-based architecture that enhances text-to-image models with spatial and temporal controls. This approach supports the generation of high-quality, variable-length videos with controllable human faces and bodies. VLOGGER does not require training for each individual, does not rely on face detection, and considers a broad spectrum of scenarios, making it more versatile than previous methods. The authors also introduce MENTOR, a large-scale dataset with 800,000 identities and dynamic gestures, which is used to train and evaluate VLOGGER. VLOGGER outperforms state-of-the-art methods in three public benchmarks, demonstrating superior image quality, identity preservation, and temporal consistency. The method is further evaluated on diversity metrics, showing low bias and outperforming baselines in various perceived human attributes. Applications in video editing and personalization are also discussed.