Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis

Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis

9 Apr 2024 | Yanzuo Lu, Manlin Zhang, Andy J Ma, Xiaohua Xie, Jianhuang Lai
This paper proposes a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS). Existing methods often fail to capture high-level semantic understanding of person images, leading to overfitting and poor generalization. CFLD addresses this by decoupling fine-grained appearance and pose controls at different stages, enabling better generalization. The method uses a perception-refined decoder to extract semantic understanding of person images as a coarse-grained prompt, and a hybrid-granularity attention module to encode multi-scale fine-grained appearance features as bias terms, enhancing texture details. The approach is trained purely based on images without relying on image-caption pairs or textual prompts. The method is evaluated on the DeepFashion benchmark, achieving state-of-the-art performance both quantitatively and qualitatively. It also demonstrates strong performance in user studies and ablation experiments. The proposed method is efficient and end-to-end, with a lightweight pose adapter for efficient structural guidance. The architecture includes a perception-refined decoder and hybrid-granularity attention module, which work together to generate high-quality images with better generalization. The method is also effective in appearance editing and style transfer, showing the ability to maintain pose and appearance while editing the style of the generated images. The results show that CFLD outperforms existing methods in terms of image quality, texture details, and pose alignment. The method is also robust to different resolutions and is able to handle extreme or uncommon poses without overfitting. The paper concludes that CFLD is a promising approach for PGPIS, with potential applications in various domains such as film production, virtual reality, and fashion e-commerce.This paper proposes a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS). Existing methods often fail to capture high-level semantic understanding of person images, leading to overfitting and poor generalization. CFLD addresses this by decoupling fine-grained appearance and pose controls at different stages, enabling better generalization. The method uses a perception-refined decoder to extract semantic understanding of person images as a coarse-grained prompt, and a hybrid-granularity attention module to encode multi-scale fine-grained appearance features as bias terms, enhancing texture details. The approach is trained purely based on images without relying on image-caption pairs or textual prompts. The method is evaluated on the DeepFashion benchmark, achieving state-of-the-art performance both quantitatively and qualitatively. It also demonstrates strong performance in user studies and ablation experiments. The proposed method is efficient and end-to-end, with a lightweight pose adapter for efficient structural guidance. The architecture includes a perception-refined decoder and hybrid-granularity attention module, which work together to generate high-quality images with better generalization. The method is also effective in appearance editing and style transfer, showing the ability to maintain pose and appearance while editing the style of the generated images. The results show that CFLD outperforms existing methods in terms of image quality, texture details, and pose alignment. The method is also robust to different resolutions and is able to handle extreme or uncommon poses without overfitting. The paper concludes that CFLD is a promising approach for PGPIS, with potential applications in various domains such as film production, virtual reality, and fashion e-commerce.
Reach us at info@study.space