Understanding Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis

This paper proposes a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS). Existing methods for PGPIS often suffer from overfitting due to a lack of high-level semantic understanding of the source image. To address this, the authors introduce a perception-refined decoder to extract semantic understanding of person images as a coarse-grained prompt. Additionally, a hybrid-granularity attention module is proposed to encode multi-scale fine-grained appearance features as bias terms to augment the coarse-grained prompt. This allows for better control over both fine-grained appearance and pose information during the generation process. The method is trained purely based on images without relying on image-caption pairs or textual prompts. The authors conduct extensive experiments on the DeepFashion benchmark and achieve state-of-the-art performance both quantitatively and qualitatively. The method is also evaluated on the Market-1501 dataset, where it outperforms existing methods. The proposed approach is shown to be more efficient and end-to-end, with a focus on generating realistic and natural textures. The method is also able to handle complex clothing and pose transitions without overfitting. The authors also perform ablation studies to validate the effectiveness of their approach, demonstrating that the perception-refined decoder and hybrid-granularity attention module are crucial for achieving high-quality results. The method is further evaluated for its ability to generate realistic images with arbitrary poses, showing that it can consistently produce high-quality images while preserving the appearance of the source image. The results indicate that the proposed method is more robust and generalizable compared to existing approaches.This paper proposes a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS). Existing methods for PGPIS often suffer from overfitting due to a lack of high-level semantic understanding of the source image. To address this, the authors introduce a perception-refined decoder to extract semantic understanding of person images as a coarse-grained prompt. Additionally, a hybrid-granularity attention module is proposed to encode multi-scale fine-grained appearance features as bias terms to augment the coarse-grained prompt. This allows for better control over both fine-grained appearance and pose information during the generation process. The method is trained purely based on images without relying on image-caption pairs or textual prompts. The authors conduct extensive experiments on the DeepFashion benchmark and achieve state-of-the-art performance both quantitatively and qualitatively. The method is also evaluated on the Market-1501 dataset, where it outperforms existing methods. The proposed approach is shown to be more efficient and end-to-end, with a focus on generating realistic and natural textures. The method is also able to handle complex clothing and pose transitions without overfitting. The authors also perform ablation studies to validate the effectiveness of their approach, demonstrating that the perception-refined decoder and hybrid-granularity attention module are crucial for achieving high-quality results. The method is further evaluated for its ability to generate realistic images with arbitrary poses, showing that it can consistently produce high-quality images while preserving the appearance of the source image. The results indicate that the proposed method is more robust and generalizable compared to existing approaches.

Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis

9 Apr 2024 | Yanzuo Lu, Manlin Zhang, Andy J Ma, Xiaohua Xie, Jianhuang Lai