From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

3 Jan 2024 | Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, Alexander Richard
This paper presents a framework for generating photorealistic avatars that gesture according to the conversational dynamics of dyadic interactions. Given speech audio, the method outputs multiple possibilities of gestural motion for an individual, including face, body, and hands. The key innovation combines sample diversity from vector quantization with high-frequency details obtained through diffusion to generate more dynamic and expressive motion. The generated motion is visualized using highly photorealistic avatars, which can express crucial nuances in gestures such as sneers and smirks. To facilitate this research, the authors introduce a multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show that the model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Perceptual evaluations highlight the importance of photorealism in accurately assessing subtle motion details in conversational gestures. The code and dataset are available on the project page.This paper presents a framework for generating photorealistic avatars that gesture according to the conversational dynamics of dyadic interactions. Given speech audio, the method outputs multiple possibilities of gestural motion for an individual, including face, body, and hands. The key innovation combines sample diversity from vector quantization with high-frequency details obtained through diffusion to generate more dynamic and expressive motion. The generated motion is visualized using highly photorealistic avatars, which can express crucial nuances in gestures such as sneers and smirks. To facilitate this research, the authors introduce a multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show that the model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Perceptual evaluations highlight the importance of photorealism in accurately assessing subtle motion details in conversational gestures. The code and dataset are available on the project page.
Reach us at info@study.space