29 May 2024 | Mingshuang Luo, Ruibing Hou, Hong Chang, Zimo Liu, Yaowei Wang, Shiguang Shan
This paper introduces M³GPT, an advanced Multimodal, Multitask framework for motion comprehension and generation. M³GPT operates on three fundamental principles: creating a unified representation space for various motion-relevant modalities, modeling generation directly in the raw motion space, and learning connections and synergies among different motion tasks. The framework employs discrete vector quantization for multimodal control and generation signals, such as text, music, and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. It also involves jointly training the LLM and motion de-tokenizer to optimize in both discrete semantic and raw continuous motion spaces, enhancing the LLM's ability to generate detailed motion. Additionally, M³GPT uses text as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. Extensive experiments demonstrate M³GPT's superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for challenging tasks.This paper introduces M³GPT, an advanced Multimodal, Multitask framework for motion comprehension and generation. M³GPT operates on three fundamental principles: creating a unified representation space for various motion-relevant modalities, modeling generation directly in the raw motion space, and learning connections and synergies among different motion tasks. The framework employs discrete vector quantization for multimodal control and generation signals, such as text, music, and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. It also involves jointly training the LLM and motion de-tokenizer to optimize in both discrete semantic and raw continuous motion spaces, enhancing the LLM's ability to generate detailed motion. Additionally, M³GPT uses text as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. Extensive experiments demonstrate M³GPT's superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for challenging tasks.