[slides] Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On

The paper introduces a Texture-Preserving Diffusion (TPD) model for high-fidelity virtual try-on, addressing the challenges of texture transfer and fidelity in image-based virtual try-on. The TPD model enhances the fidelity of synthesized images by: 1. **Self-Attention-based Texture Transfer (SATT)**: The model concatenates the masked person image and the reference garment image along the spatial dimension, feeding the combined image into the denoising UNet of the Stable Diffusion (SD) model. This approach leverages the self-attention blocks in the SD model to efficiently transfer textures from the garment to the person image, avoiding the need for additional image encoders. 2. **Decoupled Mask Prediction (DMP)**: The model predicts an accurate inpainting mask for each person-garment pair, ensuring that irrelevant textures are preserved while removing the original garment. This is achieved by iteratively denoising the mask from random noise to an accurate inpainting area, using the reference garment and the original person image. The TPD model is evaluated on the VITON and VITON-HD databases, demonstrating superior performance in terms of realism and coherence compared to state-of-the-art methods. The model also shows robustness to challenging poses and complex textures, making it suitable for various virtual try-on tasks, including garment-to-person and person-to-person try-ons. The paper concludes with a discussion on the limitations and broader impacts of the work, highlighting its potential for real-world applications in online shopping and e-commerce.The paper introduces a Texture-Preserving Diffusion (TPD) model for high-fidelity virtual try-on, addressing the challenges of texture transfer and fidelity in image-based virtual try-on. The TPD model enhances the fidelity of synthesized images by: 1. **Self-Attention-based Texture Transfer (SATT)**: The model concatenates the masked person image and the reference garment image along the spatial dimension, feeding the combined image into the denoising UNet of the Stable Diffusion (SD) model. This approach leverages the self-attention blocks in the SD model to efficiently transfer textures from the garment to the person image, avoiding the need for additional image encoders. 2. **Decoupled Mask Prediction (DMP)**: The model predicts an accurate inpainting mask for each person-garment pair, ensuring that irrelevant textures are preserved while removing the original garment. This is achieved by iteratively denoising the mask from random noise to an accurate inpainting area, using the reference garment and the original person image. The TPD model is evaluated on the VITON and VITON-HD databases, demonstrating superior performance in terms of realism and coherence compared to state-of-the-art methods. The model also shows robustness to challenging poses and complex textures, making it suitable for various virtual try-on tasks, including garment-to-person and person-to-person try-ons. The paper concludes with a discussion on the limitations and broader impacts of the work, highlighting its potential for real-world applications in online shopping and e-commerce.

Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On

1 Apr 2024 | Xu Yang, Changxing Ding, Zhibin Hong, Junhao Huang, Jin Tao, Xiangmin Xu