[slides] ConvoFusion%3A Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

**ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis** **Authors:** Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, Christian Theobalt **Institutions:** Max Planck Institute for Informatics, Saarland University, Vrije Universiteit Amsterdam **Abstract:** Gestures play a crucial role in human communication. While recent methods for co-speech gesture generation can produce beat-aligned motions, they struggle to generate semantically aligned gestures. CONVOFUSION is a diffusion-based approach that can generate gestures based on multi-modal speech inputs and facilitate controllability in gesture synthesis. The method proposes two guidance objectives to allow users to modulate the impact of different conditioning modalities (e.g., audio vs text) and emphasize specific words during gesturing. CONVOFUSION supports monadic and dyadic gesture synthesis and introduces the DND GROUP GESTURE dataset, which contains 6 hours of gesture data from 5 participants playing Dungeons and Dragons. The dataset includes high-quality full-body motion capture, multi-channel audio recordings, and text transcriptions. The method's effectiveness is demonstrated through comparisons with several recent works on various tasks. **Key Contributions:** - CONVOFUSION: A diffusion-based approach for monadic and dyadic gesture synthesis, capable of generating both co-speech and reactive/active gestures. - Two guidance objectives: Modality guidance and word-excitation guidance, enabling coarse and fine-grained control over the generated gestures. - Time-aware latent representation: Encodes motion into chunked latents, allowing for perpetual gesture synthesis and temporal consistency. - DND GROUP GESTURE dataset: A high-quality dataset involving 5 participants in multiple sessions of Dungeons and Dragons, facilitating research on dyadic and group gesture synthesis. **Methods:** - **Scale-aware Temporal Latent Representation:** Decouples finger motions from body motions and encodes them into separate latent spaces. - **Modality-Conditional Gesture Generation:** Uses a transformer decoder to approximate the denoising function, integrating multiple modalities (audio, text, speaker identity). - **Controllable Gesture Generation:** Features modality-guidance and word-excitation guidance to control the impact of specific modalities and enhance gestures for selected words. **Evaluation:** - **Monadic Co-speech Gesture Synthesis:** Outperforms baselines in beat alignment, diversity, and FID scores. - **Dyadic Co-speech Gesture Synthesis:** Achieves similar beat alignment to ground-truth while producing more diverse and animated motions. - **User Study:** Demonstrates better semantic alignment and user preference for CONVOFUSION-generated gestures. **Conclusion:** CONVOFUSION addresses the challenge of generating semantically coherent gestures in conversational settings by leveraging a time-aware latent representation and advanced control mechanisms. The**ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis** **Authors:** Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, Christian Theobalt **Institutions:** Max Planck Institute for Informatics, Saarland University, Vrije Universiteit Amsterdam **Abstract:** Gestures play a crucial role in human communication. While recent methods for co-speech gesture generation can produce beat-aligned motions, they struggle to generate semantically aligned gestures. CONVOFUSION is a diffusion-based approach that can generate gestures based on multi-modal speech inputs and facilitate controllability in gesture synthesis. The method proposes two guidance objectives to allow users to modulate the impact of different conditioning modalities (e.g., audio vs text) and emphasize specific words during gesturing. CONVOFUSION supports monadic and dyadic gesture synthesis and introduces the DND GROUP GESTURE dataset, which contains 6 hours of gesture data from 5 participants playing Dungeons and Dragons. The dataset includes high-quality full-body motion capture, multi-channel audio recordings, and text transcriptions. The method's effectiveness is demonstrated through comparisons with several recent works on various tasks. **Key Contributions:** - CONVOFUSION: A diffusion-based approach for monadic and dyadic gesture synthesis, capable of generating both co-speech and reactive/active gestures. - Two guidance objectives: Modality guidance and word-excitation guidance, enabling coarse and fine-grained control over the generated gestures. - Time-aware latent representation: Encodes motion into chunked latents, allowing for perpetual gesture synthesis and temporal consistency. - DND GROUP GESTURE dataset: A high-quality dataset involving 5 participants in multiple sessions of Dungeons and Dragons, facilitating research on dyadic and group gesture synthesis. **Methods:** - **Scale-aware Temporal Latent Representation:** Decouples finger motions from body motions and encodes them into separate latent spaces. - **Modality-Conditional Gesture Generation:** Uses a transformer decoder to approximate the denoising function, integrating multiple modalities (audio, text, speaker identity). - **Controllable Gesture Generation:** Features modality-guidance and word-excitation guidance to control the impact of specific modalities and enhance gestures for selected words. **Evaluation:** - **Monadic Co-speech Gesture Synthesis:** Outperforms baselines in beat alignment, diversity, and FID scores. - **Dyadic Co-speech Gesture Synthesis:** Achieves similar beat alignment to ground-truth while producing more diverse and animated motions. - **User Study:** Demonstrates better semantic alignment and user preference for CONVOFUSION-generated gestures. **Conclusion:** CONVOFUSION addresses the challenge of generating semantically coherent gestures in conversational settings by leveraging a time-aware latent representation and advanced control mechanisms. The

ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

26 Mar 2024 | Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, Christian Theobalt