29 May 2024 | Mingshuang Luo¹,²,³, Ruibing Hou¹, Hong Chang¹,³, Zimo Liu², Yaowei Wang², Shiguang Shan¹,³
M³GPT is a novel multimodal, multitask framework for motion comprehension and generation, capable of handling tasks such as text-to-motion, motion-to-text, music-to-dance, dance-to-music, motion prediction, and motion in-between. The framework integrates various motion-related modalities (text, music, motion/dance) into a unified representation space using discrete vector quantization. It directly models motion generation in the raw motion space, avoiding information loss from discrete tokenization. Additionally, it learns to model the connections and synergies among different motion tasks, using text as a bridge to align diverse modalities. M³GPT is the first model to comprehensively handle multiple motion-related signals and demonstrates strong zero-shot generalization capabilities. The framework includes multimodal tokenizers that compress raw data into discrete tokens and a motion-aware language model that generates motion tokens. The model is trained through three stages: multimodal tokenizer training, modality-alignment pre-training, and instruction tuning. It achieves competitive performance across various motion tasks and exhibits strong zero-shot generalization, such as long-term dance generation and music-text conditioned dance generation. M³GPT outperforms existing methods in motion-related tasks and demonstrates effective synergy learning and joint optimization of LLM and motion de-tokenizer. The model is evaluated on multiple datasets and shows strong performance in both motion comprehension and generation. Limitations include focusing on human body movements and excluding hands and faces. Future research could extend the framework to include these details.M³GPT is a novel multimodal, multitask framework for motion comprehension and generation, capable of handling tasks such as text-to-motion, motion-to-text, music-to-dance, dance-to-music, motion prediction, and motion in-between. The framework integrates various motion-related modalities (text, music, motion/dance) into a unified representation space using discrete vector quantization. It directly models motion generation in the raw motion space, avoiding information loss from discrete tokenization. Additionally, it learns to model the connections and synergies among different motion tasks, using text as a bridge to align diverse modalities. M³GPT is the first model to comprehensively handle multiple motion-related signals and demonstrates strong zero-shot generalization capabilities. The framework includes multimodal tokenizers that compress raw data into discrete tokens and a motion-aware language model that generates motion tokens. The model is trained through three stages: multimodal tokenizer training, modality-alignment pre-training, and instruction tuning. It achieves competitive performance across various motion tasks and exhibits strong zero-shot generalization, such as long-term dance generation and music-text conditioned dance generation. M³GPT outperforms existing methods in motion-related tasks and demonstrates effective synergy learning and joint optimization of LLM and motion de-tokenizer. The model is evaluated on multiple datasets and shows strong performance in both motion comprehension and generation. Limitations include focusing on human body movements and excluding hands and faces. Future research could extend the framework to include these details.