[slides and audio] Generating Human Interaction Motions in Scenes with Text Control

TeSMo is a text-controlled method for generating realistic human-scene interactions from text input. Given a 3D scene, TeSMo generates scene-aware motions such as walking in free space and sitting on a chair. The method combines a pre-trained text-to-motion diffusion model with a scene-aware component, fine-tuned using data augmented with detailed scene information. The model produces realistic and diverse human-object interactions in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments show that TeSMo surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions. The method decomposes the task into navigation and interaction components, with navigation generating a pelvis trajectory and interaction generating full-body motion. The model is trained on a diverse motion dataset and augmented with scene information. The navigation model generates a root trajectory that reaches a goal pose near the interaction object, while the interaction model generates full-body motion conditioned on a goal pelvis pose and a detailed 3D representation of the target object. The model is fine-tuned using augmented data that re-targets interactions to a variety of object shapes. The method achieves superior goal-reaching accuracy and avoids collisions, with a user study showing that 71.9% of participants preferred motions generated by TeSMo over those produced by DIMOS. The method also demonstrates the ability to generate diverse locomotion styles controlled by text in various scenes. The approach introduces the Loco-3D-FRONT dataset containing realistic navigation motions placed in 3D scenes and extends the SAMP dataset with additional objects and text annotations. The method is evaluated on the Loco-3D-FRONT and SAMP datasets, showing that the generated motion is on par with state-of-the-art diffusion models while improving the plausibility and realism of interactions compared to prior work. The method is capable of generating realistic human-object interactions in various scenes and is controlled by text prompts. The method is also capable of generating diverse locomotion styles controlled by text in various scenes. The method is evaluated on the Loco-3D-FRONT and SAMP datasets, showing that the generated motion is on par with state-of-the-art diffusion models while improving the plausibility and realism of interactions compared to prior work.TeSMo is a text-controlled method for generating realistic human-scene interactions from text input. Given a 3D scene, TeSMo generates scene-aware motions such as walking in free space and sitting on a chair. The method combines a pre-trained text-to-motion diffusion model with a scene-aware component, fine-tuned using data augmented with detailed scene information. The model produces realistic and diverse human-object interactions in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments show that TeSMo surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions. The method decomposes the task into navigation and interaction components, with navigation generating a pelvis trajectory and interaction generating full-body motion. The model is trained on a diverse motion dataset and augmented with scene information. The navigation model generates a root trajectory that reaches a goal pose near the interaction object, while the interaction model generates full-body motion conditioned on a goal pelvis pose and a detailed 3D representation of the target object. The model is fine-tuned using augmented data that re-targets interactions to a variety of object shapes. The method achieves superior goal-reaching accuracy and avoids collisions, with a user study showing that 71.9% of participants preferred motions generated by TeSMo over those produced by DIMOS. The method also demonstrates the ability to generate diverse locomotion styles controlled by text in various scenes. The approach introduces the Loco-3D-FRONT dataset containing realistic navigation motions placed in 3D scenes and extends the SAMP dataset with additional objects and text annotations. The method is evaluated on the Loco-3D-FRONT and SAMP datasets, showing that the generated motion is on par with state-of-the-art diffusion models while improving the plausibility and realism of interactions compared to prior work. The method is capable of generating realistic human-object interactions in various scenes and is controlled by text prompts. The method is also capable of generating diverse locomotion styles controlled by text in various scenes. The method is evaluated on the Loco-3D-FRONT and SAMP datasets, showing that the generated motion is on par with state-of-the-art diffusion models while improving the plausibility and realism of interactions compared to prior work.

Generating Human Interaction Motions in Scenes with Text Control

16 Apr 2024 | Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, and Davis Rempe