Towards Variable and Coordinated Holistic Co-Speech Motion Generation

Towards Variable and Coordinated Holistic Co-Speech Motion Generation

15 Apr 2024 | Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, Changxing Ding
This paper presents ProbTalk, a unified probabilistic framework for generating lifelike, variable, and coordinated holistic co-speech motions for 3D avatars. The framework addresses two key challenges: variability, allowing avatars to exhibit a wide range of motions for similar speech, and coordination, ensuring harmonious alignment among facial expressions, hand gestures, and body poses. ProbTalk is based on the variational autoencoder (VAE) architecture and incorporates three core designs: product quantization (PQ) to enhance representation of complex motions, a novel non-autoregressive model with 2D positional encoding to preserve structural information, and a secondary stage to refine motion details. These components enable ProbTalk to generate natural and diverse co-speech motions, outperforming state-of-the-art methods in both qualitative and quantitative evaluations, particularly in terms of realism. The framework is evaluated on the SHOW dataset, demonstrating superior performance in terms of realism, diversity, and inference efficiency. ProbTalk also supports multi-modal conditioning, incorporating audio, motion context, and speaker identity to enhance motion generation. The results show that ProbTalk achieves significant improvements in motion quality and realism compared to existing methods, making it a promising approach for generating realistic co-speech motions.This paper presents ProbTalk, a unified probabilistic framework for generating lifelike, variable, and coordinated holistic co-speech motions for 3D avatars. The framework addresses two key challenges: variability, allowing avatars to exhibit a wide range of motions for similar speech, and coordination, ensuring harmonious alignment among facial expressions, hand gestures, and body poses. ProbTalk is based on the variational autoencoder (VAE) architecture and incorporates three core designs: product quantization (PQ) to enhance representation of complex motions, a novel non-autoregressive model with 2D positional encoding to preserve structural information, and a secondary stage to refine motion details. These components enable ProbTalk to generate natural and diverse co-speech motions, outperforming state-of-the-art methods in both qualitative and quantitative evaluations, particularly in terms of realism. The framework is evaluated on the SHOW dataset, demonstrating superior performance in terms of realism, diversity, and inference efficiency. ProbTalk also supports multi-modal conditioning, incorporating audio, motion context, and speaker identity to enhance motion generation. The results show that ProbTalk achieves significant improvements in motion quality and realism compared to existing methods, making it a promising approach for generating realistic co-speech motions.
Reach us at info@study.space
[slides] Towards Variable and Coordinated Holistic Co-Speech Motion Generation | StudySpace