3 Jan 2024 | Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, Alexander Richard
This paper presents a framework for generating photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, the framework outputs multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind the method is combining the benefits of sample diversity from vector quantization with high-frequency details from diffusion to generate more dynamic and expressive motion. The framework uses a first-of-its-kind multi-view conversational dataset to enable photorealistic reconstruction. Experiments show the model generates appropriate and diverse gestures, outperforming diffusion- and VQ-only methods. Perceptual evaluation highlights the importance of photorealism in accurately assessing subtle motion details in conversational gestures. The code and dataset are publicly available.
The framework generates photorealistic avatars conditioned on the speech audio of a dyadic conversation. It synthesizes diverse high-frequency gestures and expressive facial movements synchronized with speech. For the body and hands, it leverages both an autoregressive VQ-based method and a diffusion model. The VQ transformer takes conversational audio as input and outputs a sequence of guide poses at a reduced frame rate, allowing diverse poses while avoiding drift. The audio and guide poses are then passed into the diffusion model to infill intricate motion details at a higher fps. For the face, an audio-conditioned diffusion model is used. The predicted face, body, and hand motion are then rendered with a photorealistic avatar.
The framework introduces a rich dataset of dyadic interactions captured in a multi-view system, allowing for accurate body/face tracking and photorealistic 3D reconstructions. The dataset includes non-scripted, long-form conversations covering a wide range of topics and emotions. The dataset includes photorealistic renders of each individual, capturing the dynamics of interpersonal conversations rather than individual monologues. The dataset and renderer are publicly available.
The framework's method generates more realistic and diverse motion compared to prior works. It also raises the question of the validity of evaluating conversational motion using non-textured meshes. The framework introduces a novel dataset of long-form conversations that enable photorealistic conversational avatars. The code, dataset, and renderers will all be publicly available.This paper presents a framework for generating photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, the framework outputs multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind the method is combining the benefits of sample diversity from vector quantization with high-frequency details from diffusion to generate more dynamic and expressive motion. The framework uses a first-of-its-kind multi-view conversational dataset to enable photorealistic reconstruction. Experiments show the model generates appropriate and diverse gestures, outperforming diffusion- and VQ-only methods. Perceptual evaluation highlights the importance of photorealism in accurately assessing subtle motion details in conversational gestures. The code and dataset are publicly available.
The framework generates photorealistic avatars conditioned on the speech audio of a dyadic conversation. It synthesizes diverse high-frequency gestures and expressive facial movements synchronized with speech. For the body and hands, it leverages both an autoregressive VQ-based method and a diffusion model. The VQ transformer takes conversational audio as input and outputs a sequence of guide poses at a reduced frame rate, allowing diverse poses while avoiding drift. The audio and guide poses are then passed into the diffusion model to infill intricate motion details at a higher fps. For the face, an audio-conditioned diffusion model is used. The predicted face, body, and hand motion are then rendered with a photorealistic avatar.
The framework introduces a rich dataset of dyadic interactions captured in a multi-view system, allowing for accurate body/face tracking and photorealistic 3D reconstructions. The dataset includes non-scripted, long-form conversations covering a wide range of topics and emotions. The dataset includes photorealistic renders of each individual, capturing the dynamics of interpersonal conversations rather than individual monologues. The dataset and renderer are publicly available.
The framework's method generates more realistic and diverse motion compared to prior works. It also raises the question of the validity of evaluating conversational motion using non-textured meshes. The framework introduces a novel dataset of long-form conversations that enable photorealistic conversational avatars. The code, dataset, and renderers will all be publicly available.