VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

13 Mar 2024 | Enric Corona, Andrei Zanfir, Eduard Gabriel Bazavan, Nikos Kolotouros, Thiemo Alldieck, Cristian Sminchisescu
VLOGGER is a novel framework for audio-driven human video synthesis from a single input image. Given a single image and audio input, VLOGGER generates photorealistic and temporally coherent videos of a person talking and moving, including head motion, gaze, blinking, lip movement, and upper-body and hand gestures. The method is based on generative diffusion models and includes a stochastic human-to-3D motion diffusion model and a novel diffusion-based architecture that augments text-to-image models with spatial and temporal controls. This enables high-quality video generation of variable length, easily controllable through high-level representations of human faces and bodies. VLOGGER does not require training for each person, does not rely on face detection and cropping, and generates the complete image, not just the face or lips. It also considers a broad spectrum of scenarios, including visible torso or diverse subject identities, which are critical for accurate human synthesis. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation, and temporal consistency, while also generating upper-body gestures. The method is evaluated on the MENTOR dataset, a new and diverse dataset with 3D pose and expression annotations, one order of magnitude larger than previous ones. VLOGGER is shown to generate high-quality videos with diverse expressions and gestures, and it is applied to video editing and personalization. The method is based on a two-step approach: first, a generative diffusion-based network predicts body motion and facial expressions according to an input audio signal. Second, a novel architecture based on recent image diffusion models provides control in the temporal and spatial domains. By relying on generative human priors, the combined architecture improves the capacity of image diffusion models, which often struggle to generate consistent human images. VLOGGER consists of a base model followed by a super-resolution diffusion model to obtain high-quality videos. The video generation process is conditioned on 2D controls that represent the full body, including facial expressions and body and hands. VLOGGER is trained on the MENTOR dataset, which includes a large-scale dataset with diverse subjects, viewpoints, speech, and body visibility. The dataset also contains videos with dynamic hand gestures, which are important in learning the complexity of human communication. VLOGGER outperforms previous work across different diversity metrics and obtains state-of-the-art image quality and diversity results on the HDTF and TalkingHead-1KH datasets. The method is also shown to be more expressive and robust across different diversity axes. VLOGGER is applied to video editing and personalization, demonstrating its flexibility and capacity to adapt to different scenarios. The method is shown to be effective in generating diverse expressions and gestures, and it is capable of editing particular parts of an input video, such as lips or the face region. The method is also shown to be effective in generating videos of arbitrary length through a temporal outpainting approach. VLOGGER is evaluated on several metrics, including image quality, lip sync, temporalVLOGGER is a novel framework for audio-driven human video synthesis from a single input image. Given a single image and audio input, VLOGGER generates photorealistic and temporally coherent videos of a person talking and moving, including head motion, gaze, blinking, lip movement, and upper-body and hand gestures. The method is based on generative diffusion models and includes a stochastic human-to-3D motion diffusion model and a novel diffusion-based architecture that augments text-to-image models with spatial and temporal controls. This enables high-quality video generation of variable length, easily controllable through high-level representations of human faces and bodies. VLOGGER does not require training for each person, does not rely on face detection and cropping, and generates the complete image, not just the face or lips. It also considers a broad spectrum of scenarios, including visible torso or diverse subject identities, which are critical for accurate human synthesis. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation, and temporal consistency, while also generating upper-body gestures. The method is evaluated on the MENTOR dataset, a new and diverse dataset with 3D pose and expression annotations, one order of magnitude larger than previous ones. VLOGGER is shown to generate high-quality videos with diverse expressions and gestures, and it is applied to video editing and personalization. The method is based on a two-step approach: first, a generative diffusion-based network predicts body motion and facial expressions according to an input audio signal. Second, a novel architecture based on recent image diffusion models provides control in the temporal and spatial domains. By relying on generative human priors, the combined architecture improves the capacity of image diffusion models, which often struggle to generate consistent human images. VLOGGER consists of a base model followed by a super-resolution diffusion model to obtain high-quality videos. The video generation process is conditioned on 2D controls that represent the full body, including facial expressions and body and hands. VLOGGER is trained on the MENTOR dataset, which includes a large-scale dataset with diverse subjects, viewpoints, speech, and body visibility. The dataset also contains videos with dynamic hand gestures, which are important in learning the complexity of human communication. VLOGGER outperforms previous work across different diversity metrics and obtains state-of-the-art image quality and diversity results on the HDTF and TalkingHead-1KH datasets. The method is also shown to be more expressive and robust across different diversity axes. VLOGGER is applied to video editing and personalization, demonstrating its flexibility and capacity to adapt to different scenarios. The method is shown to be effective in generating diverse expressions and gestures, and it is capable of editing particular parts of an input video, such as lips or the face region. The method is also shown to be effective in generating videos of arbitrary length through a temporal outpainting approach. VLOGGER is evaluated on several metrics, including image quality, lip sync, temporal
Reach us at info@study.space