Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction

Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction

2 Apr 2024 | Junuk Cha, Jiheon Kim, Jae Shin Yoon, Seungryul Baek
This paper introduces a novel text-guided method for generating 3D hand-object interaction motion. The method is designed to generate realistic and diverse hand-object interactions from text prompts, without requiring object trajectory or initial hand pose. The approach is based on a three-stage framework that first estimates text-guided and scale-variant contact maps, then generates hand-object motions using a Transformer-based diffusion model, and finally refines the interaction by considering hand-object contact and penetration. The method addresses the challenge of generating realistic hand-object interactions from text prompts, which is difficult due to the lack of labeled data and the diversity of interaction types and object categories. To overcome this, the method decomposes the interaction generation task into two subtasks: contact generation and motion generation. For contact generation, a VAE-based network takes a text prompt and object mesh as input and generates the probability of hand-object contact during interaction. For motion generation, a Transformer-based diffusion model uses the contact map as a prior to generate physically plausible hand-object motion based on the text prompt. The method also introduces a hand refiner module that minimizes the distance between the object surface and hand joints to improve the temporal stability of the object-hand contacts and suppress penetration artifacts. The method is evaluated on three datasets (H2O, GRAB, and ARCTIC), where it outperforms other baseline methods in terms of accuracy, diversity, and physical realism. The method is also shown to be applicable to unseen objects. The model and newly labeled data are made available for future research.This paper introduces a novel text-guided method for generating 3D hand-object interaction motion. The method is designed to generate realistic and diverse hand-object interactions from text prompts, without requiring object trajectory or initial hand pose. The approach is based on a three-stage framework that first estimates text-guided and scale-variant contact maps, then generates hand-object motions using a Transformer-based diffusion model, and finally refines the interaction by considering hand-object contact and penetration. The method addresses the challenge of generating realistic hand-object interactions from text prompts, which is difficult due to the lack of labeled data and the diversity of interaction types and object categories. To overcome this, the method decomposes the interaction generation task into two subtasks: contact generation and motion generation. For contact generation, a VAE-based network takes a text prompt and object mesh as input and generates the probability of hand-object contact during interaction. For motion generation, a Transformer-based diffusion model uses the contact map as a prior to generate physically plausible hand-object motion based on the text prompt. The method also introduces a hand refiner module that minimizes the distance between the object surface and hand joints to improve the temporal stability of the object-hand contacts and suppress penetration artifacts. The method is evaluated on three datasets (H2O, GRAB, and ARCTIC), where it outperforms other baseline methods in terms of accuracy, diversity, and physical realism. The method is also shown to be applicable to unseen objects. The model and newly labeled data are made available for future research.
Reach us at info@study.space