Understanding THOR%3A Text to Human-Object Interaction Diffusion via Relation Intervention

This paper introduces THOR, a novel diffusion model for Text-to-Human-Object Interaction (Text2HOI) generation. THOR addresses the challenge of generating dynamic human-object interactions from textual descriptions, which involves complex human motion, diverse object shapes, and ambiguous object motion semantics. The model integrates a relation intervention mechanism to refine object motion based on human-object kinematic relations, enhancing spatial-temporal relations between humans and objects. THOR is trained on the Text-BEHAVE dataset, which combines textual descriptions with the largest publicly available 3D HOI dataset. The model introduces interaction losses at different motion granularities to improve the realism and consistency of generated interactions. Quantitative and qualitative experiments demonstrate that THOR outperforms existing methods in generating realistic and diverse human-object interactions. The model's effectiveness is further validated through user studies and ablation analyses, showing that the combination of relation intervention and interaction losses leads to more accurate and visually coherent results. THOR provides a unified framework for text-guided human-object interaction generation, contributing to the advancement of human-object interaction research.This paper introduces THOR, a novel diffusion model for Text-to-Human-Object Interaction (Text2HOI) generation. THOR addresses the challenge of generating dynamic human-object interactions from textual descriptions, which involves complex human motion, diverse object shapes, and ambiguous object motion semantics. The model integrates a relation intervention mechanism to refine object motion based on human-object kinematic relations, enhancing spatial-temporal relations between humans and objects. THOR is trained on the Text-BEHAVE dataset, which combines textual descriptions with the largest publicly available 3D HOI dataset. The model introduces interaction losses at different motion granularities to improve the realism and consistency of generated interactions. Quantitative and qualitative experiments demonstrate that THOR outperforms existing methods in generating realistic and diverse human-object interactions. The model's effectiveness is further validated through user studies and ablation analyses, showing that the combination of relation intervention and interaction losses leads to more accurate and visually coherent results. THOR provides a unified framework for text-guided human-object interaction generation, contributing to the advancement of human-object interaction research.

THOR: Text to Human-Object Interaction Diffusion via Relation Intervention

17 Mar 2024 | Qianyang Wu, Ye Shi, Xiaoshui Huang, Jingyi Yu, Lan Xu, and Jingya Wang