This paper addresses the challenging task of generating human motions in 3D indoor scenes based on textual descriptions. The authors propose a novel approach that decomposes the complex problem into two manageable sub-problems: language grounding of the target object and object-centric motion generation. They leverage large language models (LLMs) to perform language grounding, specifically using ChatGPT to analyze the relationship between scene descriptions and input instructions and respond with 3D visual grounding answers. For motion generation, they design an object-centric scene representation to focus on the target object, reducing scene complexity and facilitating the modeling of the relationship between human motions and the object. The method is evaluated on the HUMANISE dataset and shown to outperform baselines in terms of motion quality, object grounding accuracy, and scene alignment. The authors also demonstrate that their approach generalizes to the PROX dataset without fine-tuning. The paper provides a comprehensive overview of the problem setup, method, and experimental results, highlighting the effectiveness of their approach in generating realistic and contextually accurate human motions.This paper addresses the challenging task of generating human motions in 3D indoor scenes based on textual descriptions. The authors propose a novel approach that decomposes the complex problem into two manageable sub-problems: language grounding of the target object and object-centric motion generation. They leverage large language models (LLMs) to perform language grounding, specifically using ChatGPT to analyze the relationship between scene descriptions and input instructions and respond with 3D visual grounding answers. For motion generation, they design an object-centric scene representation to focus on the target object, reducing scene complexity and facilitating the modeling of the relationship between human motions and the object. The method is evaluated on the HUMANISE dataset and shown to outperform baselines in terms of motion quality, object grounding accuracy, and scene alignment. The authors also demonstrate that their approach generalizes to the PROX dataset without fine-tuning. The paper provides a comprehensive overview of the problem setup, method, and experimental results, highlighting the effectiveness of their approach in generating realistic and contextually accurate human motions.