20 Mar 2024 | Yiming Huang, Weilin Wan, Yue Yang, Chris Callison-Burch, Mark Yatskar, Lingjie Liu
CoMo is a controllable motion generation model that leverages language-guided pose code editing to generate and edit human motion sequences. The model decomposes motions into discrete, semantically meaningful pose codes, each representing the state of a specific body part. Given textual inputs, CoMo autoregressively generates sequences of pose codes, which are then decoded into 3D motions. The model uses large language models (LLMs) to edit these pose codes according to editing instructions, enabling fine-grained control over motion generation and editing. CoMo achieves competitive performance in motion generation compared to state-of-the-art models and significantly outperforms previous work in motion editing, as demonstrated by human studies. The model's key contributions include a semantic motion representation that factorizes motion sequences into interpretable pose codes, a transformer-based model that autoregressively generates low-level pose codes based on high-level text descriptions, and the use of semantic pose codes as an intuitive interface for LLMs to perform motion editing. CoMo is evaluated on the HumanML3D and KIT datasets, achieving top performance in motion generation and demonstrating superior motion editing capabilities. The model's ability to generate and edit motions through language inputs makes it a promising approach for text-driven, controllable motion generation.CoMo is a controllable motion generation model that leverages language-guided pose code editing to generate and edit human motion sequences. The model decomposes motions into discrete, semantically meaningful pose codes, each representing the state of a specific body part. Given textual inputs, CoMo autoregressively generates sequences of pose codes, which are then decoded into 3D motions. The model uses large language models (LLMs) to edit these pose codes according to editing instructions, enabling fine-grained control over motion generation and editing. CoMo achieves competitive performance in motion generation compared to state-of-the-art models and significantly outperforms previous work in motion editing, as demonstrated by human studies. The model's key contributions include a semantic motion representation that factorizes motion sequences into interpretable pose codes, a transformer-based model that autoregressively generates low-level pose codes based on high-level text descriptions, and the use of semantic pose codes as an intuitive interface for LLMs to perform motion editing. CoMo is evaluated on the HumanML3D and KIT datasets, achieving top performance in motion generation and demonstrating superior motion editing capabilities. The model's ability to generate and edit motions through language inputs makes it a promising approach for text-driven, controllable motion generation.