Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

2 Apr 2024 | Xu He, Qiaochu Huang, Zhensong Zhang, Zhiwei Lin, Zhiyong Wu, Sicheng Yang, Minglei Li, Zhiyi Chen, Songcen Xu, Xiaofei Wu
This paper presents a novel motion-decoupled framework for generating co-speech gesture videos directly from audio input, without relying on structural human priors. The framework addresses two main challenges: 1) capturing complex human motion with essential appearance information, and 2) aligning gestures with speech over arbitrary lengths. To achieve this, the authors propose a nonlinear TPS transformation to extract latent motion features, followed by a transformer-based diffusion model to learn the temporal correlation between speech and gestures. An optimal motion selection module is then used to generate long-term coherent gesture videos. A refinement network is also introduced to enhance visual quality by focusing on missing details. The framework outperforms existing methods in both motion and video-related evaluations, generating high-quality, realistic, and speech-matched gesture videos. The method is evaluated on the PATS dataset, and results show that the proposed framework significantly improves performance in motion-related metrics such as FGD and Diversity, as well as video-related metrics like FVD. The framework also excels in generating precise and diverse fine-grained hand movements, which is crucial for high-quality human gestures. A user study further confirms that the generated videos are perceived as natural and well-matched to speech. The framework is implemented with a motion-decoupled approach, allowing for the generation of long-term consistent gesture videos. The code, demos, and more resources are available at https://github.com/thuhcsi/S2G-MDDiffusion.This paper presents a novel motion-decoupled framework for generating co-speech gesture videos directly from audio input, without relying on structural human priors. The framework addresses two main challenges: 1) capturing complex human motion with essential appearance information, and 2) aligning gestures with speech over arbitrary lengths. To achieve this, the authors propose a nonlinear TPS transformation to extract latent motion features, followed by a transformer-based diffusion model to learn the temporal correlation between speech and gestures. An optimal motion selection module is then used to generate long-term coherent gesture videos. A refinement network is also introduced to enhance visual quality by focusing on missing details. The framework outperforms existing methods in both motion and video-related evaluations, generating high-quality, realistic, and speech-matched gesture videos. The method is evaluated on the PATS dataset, and results show that the proposed framework significantly improves performance in motion-related metrics such as FGD and Diversity, as well as video-related metrics like FVD. The framework also excels in generating precise and diverse fine-grained hand movements, which is crucial for high-quality human gestures. A user study further confirms that the generated videos are perceived as natural and well-matched to speech. The framework is implemented with a motion-decoupled approach, allowing for the generation of long-term consistent gesture videos. The code, demos, and more resources are available at https://github.com/thuhcsi/S2G-MDDiffusion.
Reach us at info@study.space
Understanding Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model