**Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis**
This paper introduces *Semantic Gesticulator*, a novel framework designed to synthesize realistic and semantically rich gestures that accompany speech. The system addresses the challenge of generating meaningful gestures, which are often sparse and difficult to capture in natural human motion datasets. To achieve this, the authors develop a generative retrieval framework based on a large language model (LLM) that efficiently retrieves suitable semantic gesture candidates from a comprehensive motion library in response to input speech. The motion library is constructed by summarizing commonly used semantic gestures based on linguistic and behavioral studies and collecting a high-quality motion dataset encompassing body and hand movements.
The system consists of three main components:
1. **End-to-End Neural Generator**: A GPT-based model that predicts discrete gesture tokens conditioned on speech and synchronized audio features.
2. **Generative Retrieval Framework**: An LLM-based model that interprets transcript context and selects appropriate semantic gestures from the motion library.
3. **Semantics-Aware Alignment Mechanism**: A module that integrates the retrieved semantic gestures with the rhythmic motion generated by the generator, ensuring both semantic richness and rhythmic coherence.
The authors evaluate the system using two high-quality speech-gesture datasets (ZEGGS and BEAT) and conduct user studies to assess the quality and human-likeness of the generated gestures. The results show that the system outperforms state-of-the-art methods in terms of semantic appropriateness and robustness in generating rhythmically coherent and semantically explicit gestures. The code and dataset will be released for academic research.**Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis**
This paper introduces *Semantic Gesticulator*, a novel framework designed to synthesize realistic and semantically rich gestures that accompany speech. The system addresses the challenge of generating meaningful gestures, which are often sparse and difficult to capture in natural human motion datasets. To achieve this, the authors develop a generative retrieval framework based on a large language model (LLM) that efficiently retrieves suitable semantic gesture candidates from a comprehensive motion library in response to input speech. The motion library is constructed by summarizing commonly used semantic gestures based on linguistic and behavioral studies and collecting a high-quality motion dataset encompassing body and hand movements.
The system consists of three main components:
1. **End-to-End Neural Generator**: A GPT-based model that predicts discrete gesture tokens conditioned on speech and synchronized audio features.
2. **Generative Retrieval Framework**: An LLM-based model that interprets transcript context and selects appropriate semantic gestures from the motion library.
3. **Semantics-Aware Alignment Mechanism**: A module that integrates the retrieved semantic gestures with the rhythmic motion generated by the generator, ensuring both semantic richness and rhythmic coherence.
The authors evaluate the system using two high-quality speech-gesture datasets (ZEGGS and BEAT) and conduct user studies to assess the quality and human-likeness of the generated gestures. The results show that the system outperforms state-of-the-art methods in terms of semantic appropriateness and robustness in generating rhythmically coherent and semantically explicit gestures. The code and dataset will be released for academic research.