THOR: Text to Human-Object Interaction Diffusion via Relation Intervention

THOR: Text to Human-Object Interaction Diffusion via Relation Intervention

17 Mar 2024 | Qianyang Wu, Ye Shi, Xiaoshui Huang, Jingyi Yu, Lan Xu, and Jingya Wang
This paper addresses the challenging task of generating dynamic Human-Object Interactions (HOI) from textual descriptions (Text2HOI). Unlike existing works that focus on limited body parts or static objects, this paper aims to handle variations in human motion, diverse object shapes, and semantic vagueness of object motion. To achieve this, the authors propose THOR (Text-guided Human-Object Interaction Diffusion with Relation Intervention), a novel Text-guided Human-Object Interaction diffusion model equipped with a relation intervention mechanism. In each diffusion step, THOR initiates text-guided human and object motion and then leverages human-object relations to intervene in object motion, enhancing the spatial-temporal relations between humans and objects. The model introduces interaction losses at different levels of motion granularity to improve the realism of generated interactions. Additionally, the authors construct Text-BEHAVE, a dataset that integrates textual descriptions with the largest publicly available 3D HOI dataset. Both quantitative and qualitative experiments demonstrate the effectiveness of the proposed model in generating coherent and realistic human-object interactions.This paper addresses the challenging task of generating dynamic Human-Object Interactions (HOI) from textual descriptions (Text2HOI). Unlike existing works that focus on limited body parts or static objects, this paper aims to handle variations in human motion, diverse object shapes, and semantic vagueness of object motion. To achieve this, the authors propose THOR (Text-guided Human-Object Interaction Diffusion with Relation Intervention), a novel Text-guided Human-Object Interaction diffusion model equipped with a relation intervention mechanism. In each diffusion step, THOR initiates text-guided human and object motion and then leverages human-object relations to intervene in object motion, enhancing the spatial-temporal relations between humans and objects. The model introduces interaction losses at different levels of motion granularity to improve the realism of generated interactions. Additionally, the authors construct Text-BEHAVE, a dataset that integrates textual descriptions with the largest publicly available 3D HOI dataset. Both quantitative and qualitative experiments demonstrate the effectiveness of the proposed model in generating coherent and realistic human-object interactions.
Reach us at info@study.space
[slides and audio] THOR%3A Text to Human-Object Interaction Diffusion via Relation Intervention