Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis

Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis

July 2024 | Zeyi Zhang, Tenglong Ao, Yuyao Zhang, Qingzhe Gao, Chuan Lin, Baoquan Chen, Libin Liu
Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis This paper presents Semantic Gesticulator, a novel framework for synthesizing realistic co-speech gestures with strong semantic correspondence. Semantic gestures are crucial for effective non-verbal communication but are rare and sparse in natural human motion. To address this, the authors develop a generative retrieval framework based on a large language model (LLM) to efficiently retrieve suitable semantic gesture candidates from a motion library. They compile a comprehensive list of commonly used semantic gestures based on linguistic and behavioral studies and collect a high-quality motion dataset encompassing body and hand movements. A GPT-based model is designed to generate high-quality gestures that match the rhythm of speech. A semantics-aware alignment mechanism is proposed to align retrieved gestures with the model's output, ensuring naturalness. The system demonstrates robustness in generating rhythmically coherent and semantically explicit gestures, as evidenced by examples. User studies confirm the quality and human-likeness of the results, showing that the system outperforms state-of-the-art systems in terms of semantic appropriateness. The code and dataset are released for academic research. The system is built on a discrete latent motion space learned through a residual VQ-VAE. It tokenizes gesture sequences into hierarchical and compact motion tokens, ensuring motion quality and diversity. The system comprises three key modules: an end-to-end neural generator, a generative retrieval framework based on an LLM, and a semantics-aware alignment mechanism. The generator predicts discrete gesture tokens conditioned on speech, and the tokens are decoded into gesture motion. The retrieval framework interprets transcript context and selects suitable semantic gestures from a motion library. The alignment mechanism fuses semantic and rhythmic gestures at the latent space level, ensuring that the generated gestures are both meaningful and rhythmically coherent. The authors introduce a novel semantics-aware co-speech gesture synthesis system that produces natural and semantically rich gestures. The GPT-based generator and the semantics-aware alignment mechanism effectively ensure motion quality and generalization across different audio inputs. They develop an LLM-based generative semantic gesture retrieval framework capable of efficiently retrieving semantic gestures from a gesture library. They compile a comprehensive list of commonly used semantic gestures and capture a high-quality dataset according to it. The list and the dataset will be released to the community for academic research. The system is evaluated on two high-quality speech-gesture datasets: ZEGGS and BEAT. The results show that the system outperforms state-of-the-art systems in terms of semantic accuracy. The system's performance is validated through user studies and quantitative evaluations using metrics such as Fréchet Gesture Distance (FGD) and Semantic Score (SC). The results confirm the efficiency of the system in semantic gesture synthesis. The system's ability to generate realistic and semantically appropriate gestures is demonstrated through examples. The system's design choices are validated through an ablation study. The system's data augmentation framework is proposed to enrich the diversity of the SeSemantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis This paper presents Semantic Gesticulator, a novel framework for synthesizing realistic co-speech gestures with strong semantic correspondence. Semantic gestures are crucial for effective non-verbal communication but are rare and sparse in natural human motion. To address this, the authors develop a generative retrieval framework based on a large language model (LLM) to efficiently retrieve suitable semantic gesture candidates from a motion library. They compile a comprehensive list of commonly used semantic gestures based on linguistic and behavioral studies and collect a high-quality motion dataset encompassing body and hand movements. A GPT-based model is designed to generate high-quality gestures that match the rhythm of speech. A semantics-aware alignment mechanism is proposed to align retrieved gestures with the model's output, ensuring naturalness. The system demonstrates robustness in generating rhythmically coherent and semantically explicit gestures, as evidenced by examples. User studies confirm the quality and human-likeness of the results, showing that the system outperforms state-of-the-art systems in terms of semantic appropriateness. The code and dataset are released for academic research. The system is built on a discrete latent motion space learned through a residual VQ-VAE. It tokenizes gesture sequences into hierarchical and compact motion tokens, ensuring motion quality and diversity. The system comprises three key modules: an end-to-end neural generator, a generative retrieval framework based on an LLM, and a semantics-aware alignment mechanism. The generator predicts discrete gesture tokens conditioned on speech, and the tokens are decoded into gesture motion. The retrieval framework interprets transcript context and selects suitable semantic gestures from a motion library. The alignment mechanism fuses semantic and rhythmic gestures at the latent space level, ensuring that the generated gestures are both meaningful and rhythmically coherent. The authors introduce a novel semantics-aware co-speech gesture synthesis system that produces natural and semantically rich gestures. The GPT-based generator and the semantics-aware alignment mechanism effectively ensure motion quality and generalization across different audio inputs. They develop an LLM-based generative semantic gesture retrieval framework capable of efficiently retrieving semantic gestures from a gesture library. They compile a comprehensive list of commonly used semantic gestures and capture a high-quality dataset according to it. The list and the dataset will be released to the community for academic research. The system is evaluated on two high-quality speech-gesture datasets: ZEGGS and BEAT. The results show that the system outperforms state-of-the-art systems in terms of semantic accuracy. The system's performance is validated through user studies and quantitative evaluations using metrics such as Fréchet Gesture Distance (FGD) and Semantic Score (SC). The results confirm the efficiency of the system in semantic gesture synthesis. The system's ability to generate realistic and semantically appropriate gestures is demonstrated through examples. The system's design choices are validated through an ablation study. The system's data augmentation framework is proposed to enrich the diversity of the Se
Reach us at info@study.space
Understanding Semantic Gesticulator%3A Semantics-Aware Co-Speech Gesture Synthesis