DiffH₂O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

DiffH₂O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

26 Mar 2024 | SAMMY CHRISTEN, Meta and ETH Zurich, Switzerland; SHREYAS HAMPALI, Meta, Switzerland; FADIME SENER, Meta, Switzerland; EDOARDO REMELLI, Meta, Switzerland; TOMÁŠ HODAŇ, Meta, Switzerland; ERIC SAUSER, Meta, Switzerland; SHUGAO MA, Meta, USA; BUĞRA TEKIN, Meta, Switzerland
DiffH₂O is a diffusion-based framework for synthesizing realistic hand-object interactions from textual descriptions. The method generates realistic hand-object motion from natural language, generalizes to unseen objects, and enables fine-grained control with detailed text. It decomposes the task into grasping and interaction stages, using separate diffusion models for each. A compact representation tightly couples hand and object poses, while two guidance schemes—grasp guidance and detailed textual guidance—allow more control over motion. Grasp guidance uses a single target grasp pose to guide the diffusion model, and detailed textual guidance uses comprehensive text descriptions from the GRAB dataset to enable fine-grained control. The method outperforms baseline approaches in quantitative and qualitative evaluations, demonstrating natural hand-object motions. It also shows practicality by using a hand pose estimate from an off-the-shelf pose estimator for guidance. DiffH₂O enables the synthesis of two-handed object interactions from text and generalizes to unseen objects. The framework is evaluated on the GRAB dataset and HO-3D, showing improved performance in physics-based metrics, motion diversity, and action recognition. The method also demonstrates robustness to unseen text through detailed textual annotations. DiffH₂O's two-stage approach and subsequence guidance improve generalization and controllability. The framework is practical, as demonstrated by generating multiple diverse sequences from a single grasp reference. The method is effective in generating realistic hand-object interactions and has potential for future research in integrating physics and improving inference efficiency.DiffH₂O is a diffusion-based framework for synthesizing realistic hand-object interactions from textual descriptions. The method generates realistic hand-object motion from natural language, generalizes to unseen objects, and enables fine-grained control with detailed text. It decomposes the task into grasping and interaction stages, using separate diffusion models for each. A compact representation tightly couples hand and object poses, while two guidance schemes—grasp guidance and detailed textual guidance—allow more control over motion. Grasp guidance uses a single target grasp pose to guide the diffusion model, and detailed textual guidance uses comprehensive text descriptions from the GRAB dataset to enable fine-grained control. The method outperforms baseline approaches in quantitative and qualitative evaluations, demonstrating natural hand-object motions. It also shows practicality by using a hand pose estimate from an off-the-shelf pose estimator for guidance. DiffH₂O enables the synthesis of two-handed object interactions from text and generalizes to unseen objects. The framework is evaluated on the GRAB dataset and HO-3D, showing improved performance in physics-based metrics, motion diversity, and action recognition. The method also demonstrates robustness to unseen text through detailed textual annotations. DiffH₂O's two-stage approach and subsequence guidance improve generalization and controllability. The framework is practical, as demonstrated by generating multiple diverse sequences from a single grasp reference. The method is effective in generating realistic hand-object interactions and has potential for future research in integrating physics and improving inference efficiency.
Reach us at info@study.space