This paper presents a novel method for generating human motion in 3D scenes from text descriptions. The approach involves two stages: first, identifying the target object in the scene based on text descriptions using large language models (LLMs), and second, generating human motion focused on the target object. The method addresses the challenges of multi-modality between text, scene, and motion, as well as the need for spatial reasoning. For language grounding, the method uses LLMs to understand the text and locate the target object. For motion generation, an object-centric scene representation is designed to focus on the target object, reducing scene complexity and facilitating the modeling of human motion and object interactions. The method is evaluated on the HUMANISE dataset and shows improved motion quality compared to baselines. It also generalizes well to other scenes, such as those in the PROX dataset. The approach leverages diffusion models for motion generation and uses volumetric sensors to represent the scene. The method is effective in generating realistic human motion that aligns with text descriptions and scene contexts. The results demonstrate that the method outperforms existing approaches in terms of motion quality, scene consistency, and action alignment. The method is also shown to be effective in generating motion in new, unseen scenes. The paper also discusses the design choices and ablation studies, showing that the object-centric representation and localization method significantly improve motion quality. The method is implemented using diffusion models and is trained on the AMASS dataset to improve motion quality. The results show that the method can generate high-quality human motion that is consistent with text descriptions and scene contexts. The method is effective in generating motion in both static and dynamic scenes, and the results demonstrate its generalization ability to new scenes. The paper also discusses the limitations of the method, including the short duration of generated motions and the reliance on LLMs for object localization. The method is shown to be effective in generating motion in a variety of scenarios, and the results demonstrate its potential for future applications in motion generation from text descriptions.This paper presents a novel method for generating human motion in 3D scenes from text descriptions. The approach involves two stages: first, identifying the target object in the scene based on text descriptions using large language models (LLMs), and second, generating human motion focused on the target object. The method addresses the challenges of multi-modality between text, scene, and motion, as well as the need for spatial reasoning. For language grounding, the method uses LLMs to understand the text and locate the target object. For motion generation, an object-centric scene representation is designed to focus on the target object, reducing scene complexity and facilitating the modeling of human motion and object interactions. The method is evaluated on the HUMANISE dataset and shows improved motion quality compared to baselines. It also generalizes well to other scenes, such as those in the PROX dataset. The approach leverages diffusion models for motion generation and uses volumetric sensors to represent the scene. The method is effective in generating realistic human motion that aligns with text descriptions and scene contexts. The results demonstrate that the method outperforms existing approaches in terms of motion quality, scene consistency, and action alignment. The method is also shown to be effective in generating motion in new, unseen scenes. The paper also discusses the design choices and ablation studies, showing that the object-centric representation and localization method significantly improve motion quality. The method is implemented using diffusion models and is trained on the AMASS dataset to improve motion quality. The results show that the method can generate high-quality human motion that is consistent with text descriptions and scene contexts. The method is effective in generating motion in both static and dynamic scenes, and the results demonstrate its generalization ability to new scenes. The paper also discusses the limitations of the method, including the short duration of generated motions and the reliance on LLMs for object localization. The method is shown to be effective in generating motion in a variety of scenarios, and the results demonstrate its potential for future applications in motion generation from text descriptions.