[slides and audio] HanDiffuser%3A Text-to-Image Generation with Realistic Hand Appearances

**HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances** **Supreeth Narasimhaswamy, Uttaran Bhattacharya, Xiang Chen, Ishita Dasgupta, Saayan Mitra, and Minh Hoai** **Stony Brook University, USA, Adobe Research, USA** **Abstract** Text-to-image generative models often struggle to produce realistic hands, with common issues including irregular poses, incorrect finger counts, and implausible orientations. To address this, we propose HanDiffuser, a novel diffusion-based architecture that generates realistic hands by injecting hand embeddings into the generative process. HanDiffuser consists of two components: a Text-to-Hand-Params diffusion model that generates SMPL-Body and MANO-Hand parameters from text prompts, and a Text-Guided Hand-Params-to-Image diffusion model that synthesizes images using the generated hand parameters and text prompts. We incorporate multiple aspects of hand representation, including 3D shapes, joint-level finger positions, orientations, and articulations, to ensure robust learning and reliable performance. Extensive quantitative and qualitative experiments, along with user studies, demonstrate the effectiveness of HanDiffuser in generating high-quality, realistic hands. **Introduction** Text-to-image (T2I) generative models have shown significant advancements, capable of generating high-quality, photorealistic images. However, these models often struggle with realistic hand generation, producing hands with improbable poses, irregular shapes, incorrect finger counts, and poor interactions with objects. Generating high-quality hands is challenging due to their complex articulations and interactions with other body parts. Existing hand representations based on keypoint skeletons and shape formats provide useful groundings but require integration into T2I pipelines. Our proposed HanDiffuser addresses this by generating hand parameters from text prompts and then using these parameters to condition image generation, ensuring plausible hand poses, shapes, and finger articulations. **Related Work** We review related work on text-to-image generation, text-to-human generation, and hand representations. Text-to-image generation methods include GANs, autoregressive models, VQ-VAE transformers, and diffusion models. Text-to-human generation methods focus on pose and motion synthesis, while hand representations include keypoint skeletons, shape formats, and parametric models. **HanDiffuser Architecture** HanDiffuser consists of two key components: Text-to-Hand-Params and Text-Guided Hand-Params-to-Image. The first component generates hand parameters from text inputs, while the second component uses these parameters and text to generate images. We design a Text+Hand Encoder to capture hand pose, articulation, and shape, conditioning the image generation process. **Experiments** We train HanDiffuser using curated datasets and evaluate it using metrics such as FID, KID, and hand detection confidence scores. User studies show that HanDiffuser outperforms baselines in generating realistic hands, with higher scores in pl**HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances** **Supreeth Narasimhaswamy, Uttaran Bhattacharya, Xiang Chen, Ishita Dasgupta, Saayan Mitra, and Minh Hoai** **Stony Brook University, USA, Adobe Research, USA** **Abstract** Text-to-image generative models often struggle to produce realistic hands, with common issues including irregular poses, incorrect finger counts, and implausible orientations. To address this, we propose HanDiffuser, a novel diffusion-based architecture that generates realistic hands by injecting hand embeddings into the generative process. HanDiffuser consists of two components: a Text-to-Hand-Params diffusion model that generates SMPL-Body and MANO-Hand parameters from text prompts, and a Text-Guided Hand-Params-to-Image diffusion model that synthesizes images using the generated hand parameters and text prompts. We incorporate multiple aspects of hand representation, including 3D shapes, joint-level finger positions, orientations, and articulations, to ensure robust learning and reliable performance. Extensive quantitative and qualitative experiments, along with user studies, demonstrate the effectiveness of HanDiffuser in generating high-quality, realistic hands. **Introduction** Text-to-image (T2I) generative models have shown significant advancements, capable of generating high-quality, photorealistic images. However, these models often struggle with realistic hand generation, producing hands with improbable poses, irregular shapes, incorrect finger counts, and poor interactions with objects. Generating high-quality hands is challenging due to their complex articulations and interactions with other body parts. Existing hand representations based on keypoint skeletons and shape formats provide useful groundings but require integration into T2I pipelines. Our proposed HanDiffuser addresses this by generating hand parameters from text prompts and then using these parameters to condition image generation, ensuring plausible hand poses, shapes, and finger articulations. **Related Work** We review related work on text-to-image generation, text-to-human generation, and hand representations. Text-to-image generation methods include GANs, autoregressive models, VQ-VAE transformers, and diffusion models. Text-to-human generation methods focus on pose and motion synthesis, while hand representations include keypoint skeletons, shape formats, and parametric models. **HanDiffuser Architecture** HanDiffuser consists of two key components: Text-to-Hand-Params and Text-Guided Hand-Params-to-Image. The first component generates hand parameters from text inputs, while the second component uses these parameters and text to generate images. We design a Text+Hand Encoder to capture hand pose, articulation, and shape, conditioning the image generation process. **Experiments** We train HanDiffuser using curated datasets and evaluate it using metrics such as FID, KID, and hand detection confidence scores. User studies show that HanDiffuser outperforms baselines in generating realistic hands, with higher scores in pl

HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances

22 Apr 2024 | Supreeth Narasimhaswamy*1, Uttaran Bhattacharya2, Xiang Chen2, Ishita Dasgupta2, Saayan Mitra2, and Minh Hoai1