[slides] DiffH2O%3A Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

DiffH₂O is a novel diffusion-based framework designed to synthesize realistic hand-object interactions from textual descriptions. The method addresses the challenges of generating plausible and semantically meaningful hand-object motions, especially when dealing with unseen objects. Key contributions include: 1. **Decoupling Grasping and Interaction**: The task is split into two stages: grasping and interaction. The grasping stage generates hand motions, while the interaction stage synthesizes both hand and object poses. 2. **Compact Pose Representation**: A compact representation that tightly couples hand and object poses, reducing physical artifacts such as interpenetration. 3. **Controllability**: Two guidance schemes are introduced to enhance control over the generated motions: grasp guidance and detailed textual guidance. Grasp guidance uses a single target grasp pose to guide the diffusion model, while detailed textual guidance provides comprehensive text descriptions to the GRAB dataset, enabling more fine-grained control. The evaluation demonstrates that DiffH₂O outperforms baseline methods in terms of physics-based metrics, motion diversity, and action recognition accuracy. The method also shows practicality by utilizing a hand pose estimate from an off-the-shelf estimator to guide the diffusion process and sample multiple actions in the interaction stage. The project page and videos are available at https://diffh2o.github.io/.DiffH₂O is a novel diffusion-based framework designed to synthesize realistic hand-object interactions from textual descriptions. The method addresses the challenges of generating plausible and semantically meaningful hand-object motions, especially when dealing with unseen objects. Key contributions include: 1. **Decoupling Grasping and Interaction**: The task is split into two stages: grasping and interaction. The grasping stage generates hand motions, while the interaction stage synthesizes both hand and object poses. 2. **Compact Pose Representation**: A compact representation that tightly couples hand and object poses, reducing physical artifacts such as interpenetration. 3. **Controllability**: Two guidance schemes are introduced to enhance control over the generated motions: grasp guidance and detailed textual guidance. Grasp guidance uses a single target grasp pose to guide the diffusion model, while detailed textual guidance provides comprehensive text descriptions to the GRAB dataset, enabling more fine-grained control. The evaluation demonstrates that DiffH₂O outperforms baseline methods in terms of physics-based metrics, motion diversity, and action recognition accuracy. The method also shows practicality by utilizing a hand pose estimate from an off-the-shelf estimator to guide the diffusion process and sample multiple actions in the interaction stage. The project page and videos are available at https://diffh2o.github.io/.

DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

26 Mar 2024 | SAMMY CHRISTEN, Meta and ETH Zurich, Switzerland SHREYAS HAMPALI, Meta, Switzerland FADIME SENER, Meta, Switzerland EDOARDO REMELLI, Meta, Switzerland TOMÁŠ HODAŇ, Meta, Switzerland ERIC SAUSER, Meta, Switzerland SHUGAO MA, Meta, USA BUÇRA TEKIN, Meta, Switzerland