HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

18 Mar 2024 | Mengqi Zhang, Yang Fu, Zheng Ding, Sifei Liu, Zhuowen Tu, Xiaolong Wang
HOIDiffusion is a conditional diffusion model designed to generate realistic and diverse 3D hand-object interaction data. The model takes both 3D hand-object geometric structures and text descriptions as inputs for image synthesis, enabling controllable and realistic generation by disentangling geometry from appearance. It is trained using a diffusion model pre-trained on large-scale natural images and a few 3D human demonstrations. The generated 3D data is used for learning 6D object pose estimation, improving perception systems. The model's two-stage framework first synthesizes the 3D geometric structure of the hand and object, then trains a diffusion model conditioned on both the 3D structure and text to generate corresponding RGB images. The model enables flexible control of geometry and appearance, with results showing improved performance in realistic generation and generalization to different text prompts. The model also demonstrates strong performance in hand-object interaction tasks, with generated data used for training object pose estimators. Experiments show that HOIDiffusion outperforms previous methods in hand-object image synthesis, with higher fidelity to real data and improved alignment with appearance-controlling texts. The model's ability to disentangle geometry and appearance allows for flexible style transformation without geometry distortion, essential for data construction. The model is also applied to video generation and downstream tasks, showing its effectiveness in improving perception systems. The results demonstrate that HOIDiffusion is a powerful tool for generating realistic 3D hand-object interaction data with controllable geometry and appearance.HOIDiffusion is a conditional diffusion model designed to generate realistic and diverse 3D hand-object interaction data. The model takes both 3D hand-object geometric structures and text descriptions as inputs for image synthesis, enabling controllable and realistic generation by disentangling geometry from appearance. It is trained using a diffusion model pre-trained on large-scale natural images and a few 3D human demonstrations. The generated 3D data is used for learning 6D object pose estimation, improving perception systems. The model's two-stage framework first synthesizes the 3D geometric structure of the hand and object, then trains a diffusion model conditioned on both the 3D structure and text to generate corresponding RGB images. The model enables flexible control of geometry and appearance, with results showing improved performance in realistic generation and generalization to different text prompts. The model also demonstrates strong performance in hand-object interaction tasks, with generated data used for training object pose estimators. Experiments show that HOIDiffusion outperforms previous methods in hand-object image synthesis, with higher fidelity to real data and improved alignment with appearance-controlling texts. The model's ability to disentangle geometry and appearance allows for flexible style transformation without geometry distortion, essential for data construction. The model is also applied to video generation and downstream tasks, showing its effectiveness in improving perception systems. The results demonstrate that HOIDiffusion is a powerful tool for generating realistic 3D hand-object interaction data with controllable geometry and appearance.
Reach us at info@study.space
[slides and audio] HOIDiffusion%3A Generating Realistic 3D Hand-Object Interaction Data