G-HOP is a generative model that learns a 3D hand-object prior to enable interaction reconstruction and grasp synthesis. The model uses a denoising diffusion process to generate plausible hand-object interactions, conditioned on object categories. It represents the hand as a skeletal distance field aligned with the object's signed distance field, allowing the model to jointly generate both hand and object. The model is trained on seven diverse real-world interaction datasets spanning 155 categories, enabling it to generate a wide variety of hand-object interactions. The learned prior can guide tasks such as reconstructing interaction clips and synthesizing human grasps. The model outperforms task-specific baselines in video-based reconstruction and grasp synthesis. The model's diffusion-based approach allows for computing log-likelihood gradients, enabling inference across multiple tasks. The model is evaluated on two datasets, HOI4D and 3DW, and shows improved performance in both reconstruction and grasp synthesis compared to existing methods. The model also provides a way to rank generated grasps based on their plausibility. The model's architecture includes a 3D UNet backbone and uses a text prompt to guide the generation process. The model's results demonstrate its effectiveness in generating realistic hand-object interactions and grasps.G-HOP is a generative model that learns a 3D hand-object prior to enable interaction reconstruction and grasp synthesis. The model uses a denoising diffusion process to generate plausible hand-object interactions, conditioned on object categories. It represents the hand as a skeletal distance field aligned with the object's signed distance field, allowing the model to jointly generate both hand and object. The model is trained on seven diverse real-world interaction datasets spanning 155 categories, enabling it to generate a wide variety of hand-object interactions. The learned prior can guide tasks such as reconstructing interaction clips and synthesizing human grasps. The model outperforms task-specific baselines in video-based reconstruction and grasp synthesis. The model's diffusion-based approach allows for computing log-likelihood gradients, enabling inference across multiple tasks. The model is evaluated on two datasets, HOI4D and 3DW, and shows improved performance in both reconstruction and grasp synthesis compared to existing methods. The model also provides a way to rank generated grasps based on their plausibility. The model's architecture includes a 3D UNet backbone and uses a text prompt to guide the generation process. The model's results demonstrate its effectiveness in generating realistic hand-object interactions and grasps.