HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

18 Mar 2024 | Mengqi Zhang1*, Yang Fu1*, Zheng Ding1 Zhuowen Tu1 Xiaolong Wang1 Sifei Liu2
The paper introduces HOIDiffusion, a conditional diffusion model designed to generate realistic and diverse 3D hand-object interaction (HOI) data. HOIDiffusion takes both the 3D geometric structure and text description as inputs to synthesize high-quality HOI images, offering controllable and realistic synthesis by disentangling geometry from appearance. The model is trained using a pre-trained diffusion model and a few 3D human demonstrations. It first generates the 3D geometric structure (shape and pose) of the hand and object using a GrabNet model, then fine-tunes a Stable Diffusion model with these 3D structures and text descriptions to synthesize the corresponding RGB images. The method outperforms previous approaches in terms of physically plausible interactions and generalization to unseen instances. The generated data is used for learning 6D object pose estimation, demonstrating its effectiveness in improving perception systems. The paper also includes experiments on video generation and downstream tasks, showing the model's ability to maintain consistency across frames and enhance object pose estimation performance.The paper introduces HOIDiffusion, a conditional diffusion model designed to generate realistic and diverse 3D hand-object interaction (HOI) data. HOIDiffusion takes both the 3D geometric structure and text description as inputs to synthesize high-quality HOI images, offering controllable and realistic synthesis by disentangling geometry from appearance. The model is trained using a pre-trained diffusion model and a few 3D human demonstrations. It first generates the 3D geometric structure (shape and pose) of the hand and object using a GrabNet model, then fine-tunes a Stable Diffusion model with these 3D structures and text descriptions to synthesize the corresponding RGB images. The method outperforms previous approaches in terms of physically plausible interactions and generalization to unseen instances. The generated data is used for learning 6D object pose estimation, demonstrating its effectiveness in improving perception systems. The paper also includes experiments on video generation and downstream tasks, showing the model's ability to maintain consistency across frames and enhance object pose estimation performance.
Reach us at info@study.space