20 Mar 2024 | Yiming Huang, Weilin Wan, Yue Yang, Chris Callison-Burch, Mark Yatskar, Lingjie Liu
CoMo is a controllable motion generation model that leverages large language models (LLMs) to generate and edit human motions. It decomposes motions into discrete and semantically meaningful "pose codes," each encapsulating the state of a specific body part. CoMo autoregressively generates pose code sequences from textual inputs, which are then decoded into 3D motions. The model's interpretability allows LLMs to directly intervene in motion editing by adjusting pose codes based on editing instructions. Experiments demonstrate that CoMo achieves competitive performance in motion generation compared to state-of-the-art models and significantly outperforms previous work in motion editing abilities, as shown in human studies. CoMo's key contributions include a semantic motion representation, a transformer-based motion generator, and an intuitive motion editing interface using LLMs.CoMo is a controllable motion generation model that leverages large language models (LLMs) to generate and edit human motions. It decomposes motions into discrete and semantically meaningful "pose codes," each encapsulating the state of a specific body part. CoMo autoregressively generates pose code sequences from textual inputs, which are then decoded into 3D motions. The model's interpretability allows LLMs to directly intervene in motion editing by adjusting pose codes based on editing instructions. Experiments demonstrate that CoMo achieves competitive performance in motion generation compared to state-of-the-art models and significantly outperforms previous work in motion editing abilities, as shown in human studies. CoMo's key contributions include a semantic motion representation, a transformer-based motion generator, and an intuitive motion editing interface using LLMs.