[slides and audio] Text2HOI%3A Text-Guided 3D Motion Generation for Hand-Object Interaction

This paper introduces the first text-guided method for generating 3D hand-object interaction sequences. The main challenge is the lack of labeled data, which limits the generalizability of existing datasets. To address this, the authors propose a two-subtask approach: hand-object contact generation and hand-object motion generation. For contact generation, a VAE-based network generates the probability of contacts between hands and objects based on a text prompt and an object mesh. For motion generation, a Transformer-based diffusion model uses the 3D contact map as a prior to generate physically plausible hand-object motions from text prompts. Additionally, a hand refiner module improves the temporal stability of object-hand contacts and suppresses penetration artifacts. Experiments on three datasets (H2O, GRAB, and ARCTIC) show that the proposed method outperforms baselines in terms of accuracy, diversity, and physical realism. The method is also shown to be applicable to unseen objects. The authors release their model and labeled data to facilitate future research.This paper introduces the first text-guided method for generating 3D hand-object interaction sequences. The main challenge is the lack of labeled data, which limits the generalizability of existing datasets. To address this, the authors propose a two-subtask approach: hand-object contact generation and hand-object motion generation. For contact generation, a VAE-based network generates the probability of contacts between hands and objects based on a text prompt and an object mesh. For motion generation, a Transformer-based diffusion model uses the 3D contact map as a prior to generate physically plausible hand-object motions from text prompts. Additionally, a hand refiner module improves the temporal stability of object-hand contacts and suppresses penetration artifacts. Experiments on three datasets (H2O, GRAB, and ARCTIC) show that the proposed method outperforms baselines in terms of accuracy, diversity, and physical realism. The method is also shown to be applicable to unseen objects. The authors release their model and labeled data to facilitate future research.

Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction

2 Apr 2024 | Junuk Cha, Jihyeon Kim, Jae Shin Yoon, Seungryul Baek