[slides and audio] Generating Human Motion in 3D Scenes from Text Descriptions

This paper addresses the challenging task of generating human motions in 3D indoor scenes based on textual descriptions. The authors propose a novel approach that decomposes the complex problem into two manageable sub-problems: language grounding of the target object and object-centric motion generation. They leverage large language models (LLMs) to perform language grounding, specifically using ChatGPT to analyze the relationship between scene descriptions and input instructions and respond with 3D visual grounding answers. For motion generation, they design an object-centric scene representation to focus on the target object, reducing scene complexity and facilitating the modeling of the relationship between human motions and the object. The method is evaluated on the HUMANISE dataset and shown to outperform baselines in terms of motion quality, object grounding accuracy, and scene alignment. The authors also demonstrate that their approach generalizes to the PROX dataset without fine-tuning. The paper provides a comprehensive overview of the problem setup, method, and experimental results, highlighting the effectiveness of their approach in generating realistic and contextually accurate human motions.This paper addresses the challenging task of generating human motions in 3D indoor scenes based on textual descriptions. The authors propose a novel approach that decomposes the complex problem into two manageable sub-problems: language grounding of the target object and object-centric motion generation. They leverage large language models (LLMs) to perform language grounding, specifically using ChatGPT to analyze the relationship between scene descriptions and input instructions and respond with 3D visual grounding answers. For motion generation, they design an object-centric scene representation to focus on the target object, reducing scene complexity and facilitating the modeling of the relationship between human motions and the object. The method is evaluated on the HUMANISE dataset and shown to outperform baselines in terms of motion quality, object grounding accuracy, and scene alignment. The authors also demonstrate that their approach generalizes to the PROX dataset without fine-tuning. The paper provides a comprehensive overview of the problem setup, method, and experimental results, highlighting the effectiveness of their approach in generating realistic and contextually accurate human motions.

Generating Human Motion in 3D Scenes from Text Descriptions

13 May 2024 | Zhi Cen, Huaijin Pi, Sida Peng, Zehong Shen, Minghui Yang, Shuai Zhu, Hujun Bao, Xiaowei Zhou