ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

26 Mar 2024 | Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, Christian Theobalt
ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis This paper presents CONVOFUSION, a diffusion-based approach for multi-modal gesture synthesis that can generate co-speech gestures and reactive/passive gestures. The method allows for controllability in gesture synthesis by enabling users to modulate the impact of different conditioning modalities (e.g., audio vs text) and to emphasize certain words during gesturing. The method is versatile and can be trained for monologue or conversational gestures. To further advance multi-party interactive gesture research, the DND GROUP GESTURE dataset is introduced, which contains 6 hours of gesture data showing five people interacting with one another. The dataset includes high-quality full-body motion capture, multi-channel audio recordings, and text transcriptions. The method is based on a latent denoising diffusion probabilistic model (DDPM) framework. The proposed diffusion model is trained to denoise the latent representation of the gesture motions. The generated motion latents can later be decoded using a motion decoder. Unlike existing motion latent diffusion methods, the proposed time-aware latent representation allows for perpetual gesture synthesis with high synthesis quality. The method is designed to allow coarse and fine-grained control. For coarse control, one can adjust the impact of a specific modality on the generated motion by utilizing the modality-level guidance strategy. For fine control, the user can choose specific words to enhance the gestures for the words using the proposed word-excitation guidance (WEG) objective. The method is evaluated on the BEAT dataset for monadic co-speech gesture synthesis and on the DND GROUP GESTURE dataset for dyadic co-speech gesture synthesis. The results show that the method achieves superior beat alignment and diversity scores compared to other methods. The user study also shows that the method achieves better semantic alignment with the generated motions when using WEG. The method is also evaluated on the DND GROUP GESTURE dataset for dyadic co-speech gesture synthesis. The results show that the method achieves similar beat alignment as the ground-truth while also producing higher L1 Diversity, indicating non-static motions. The method is also evaluated on the DND GROUP GESTURE dataset for user study. The results show that the method achieves better semantic alignment with the generated motions when using WEG. The method is also evaluated on the DND GROUP GESTURE dataset for ablation analysis. The results show that the chunked, scale-aware latent representation is effective for perpetual motion synthesis and better temporal alignment with the conditioning modalities. The method is also evaluated on the DND GROUP GESTURE dataset for attention maps. The results show that the attention maps highlight the spatio-temporal properties of the model training. The method is also evaluated on the DND GROUP GESTURE dataset for perpetual rollout. The results show that the chunked latent representation allows for perpetual rollout. The method is also evaluated on the DND GROUPConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis This paper presents CONVOFUSION, a diffusion-based approach for multi-modal gesture synthesis that can generate co-speech gestures and reactive/passive gestures. The method allows for controllability in gesture synthesis by enabling users to modulate the impact of different conditioning modalities (e.g., audio vs text) and to emphasize certain words during gesturing. The method is versatile and can be trained for monologue or conversational gestures. To further advance multi-party interactive gesture research, the DND GROUP GESTURE dataset is introduced, which contains 6 hours of gesture data showing five people interacting with one another. The dataset includes high-quality full-body motion capture, multi-channel audio recordings, and text transcriptions. The method is based on a latent denoising diffusion probabilistic model (DDPM) framework. The proposed diffusion model is trained to denoise the latent representation of the gesture motions. The generated motion latents can later be decoded using a motion decoder. Unlike existing motion latent diffusion methods, the proposed time-aware latent representation allows for perpetual gesture synthesis with high synthesis quality. The method is designed to allow coarse and fine-grained control. For coarse control, one can adjust the impact of a specific modality on the generated motion by utilizing the modality-level guidance strategy. For fine control, the user can choose specific words to enhance the gestures for the words using the proposed word-excitation guidance (WEG) objective. The method is evaluated on the BEAT dataset for monadic co-speech gesture synthesis and on the DND GROUP GESTURE dataset for dyadic co-speech gesture synthesis. The results show that the method achieves superior beat alignment and diversity scores compared to other methods. The user study also shows that the method achieves better semantic alignment with the generated motions when using WEG. The method is also evaluated on the DND GROUP GESTURE dataset for dyadic co-speech gesture synthesis. The results show that the method achieves similar beat alignment as the ground-truth while also producing higher L1 Diversity, indicating non-static motions. The method is also evaluated on the DND GROUP GESTURE dataset for user study. The results show that the method achieves better semantic alignment with the generated motions when using WEG. The method is also evaluated on the DND GROUP GESTURE dataset for ablation analysis. The results show that the chunked, scale-aware latent representation is effective for perpetual motion synthesis and better temporal alignment with the conditioning modalities. The method is also evaluated on the DND GROUP GESTURE dataset for attention maps. The results show that the attention maps highlight the spatio-temporal properties of the model training. The method is also evaluated on the DND GROUP GESTURE dataset for perpetual rollout. The results show that the chunked latent representation allows for perpetual rollout. The method is also evaluated on the DND GROUP
Reach us at info@study.space
[slides] ConvoFusion%3A Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis | StudySpace