Understanding Improving Diffusion Models for Authentic Virtual Try-on in the Wild

This paper addresses the challenge of image-based virtual try-on, which aims to generate images of a person wearing a specific garment. Previous methods, primarily based on Generative Adversarial Networks (GANs), often fail to preserve the fine details of the garment and struggle with generalization to different human images. To overcome these limitations, the authors propose *IDM-VTON*, an improved diffusion model that enhances the fidelity and authenticity of virtual try-on images. IDM-VTON uses two modules to encode the semantics of the garment image: a visual encoder and a parallel UNet. The visual encoder extracts high-level semantics, which are fused with the cross-attention layer, while the parallel UNet extracts low-level features, which are fused with the self-attention layer. Additionally, detailed textual prompts for both the garment and person images are provided to enhance the authenticity of the generated visuals. The method also includes a customization feature that improves the model's performance when given a pair of person-garment images. Experimental results on the VITON-HD and DressCode datasets demonstrate that IDM-VTON outperforms previous methods in preserving garment details and generating authentic virtual try-on images. Furthermore, the proposed customization method shows significant improvements in real-world scenarios, as evaluated on a collected In-the-Wild dataset with intricate patterns and diverse backgrounds. The paper concludes by discussing potential negative impacts and limitations of the technology, emphasizing the need for responsible use.This paper addresses the challenge of image-based virtual try-on, which aims to generate images of a person wearing a specific garment. Previous methods, primarily based on Generative Adversarial Networks (GANs), often fail to preserve the fine details of the garment and struggle with generalization to different human images. To overcome these limitations, the authors propose *IDM-VTON*, an improved diffusion model that enhances the fidelity and authenticity of virtual try-on images. IDM-VTON uses two modules to encode the semantics of the garment image: a visual encoder and a parallel UNet. The visual encoder extracts high-level semantics, which are fused with the cross-attention layer, while the parallel UNet extracts low-level features, which are fused with the self-attention layer. Additionally, detailed textual prompts for both the garment and person images are provided to enhance the authenticity of the generated visuals. The method also includes a customization feature that improves the model's performance when given a pair of person-garment images. Experimental results on the VITON-HD and DressCode datasets demonstrate that IDM-VTON outperforms previous methods in preserving garment details and generating authentic virtual try-on images. Furthermore, the proposed customization method shows significant improvements in real-world scenarios, as evaluated on a collected In-the-Wild dataset with intricate patterns and diverse backgrounds. The paper concludes by discussing potential negative impacts and limitations of the technology, emphasizing the need for responsible use.

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

29 Jul 2024 | Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, Jinwoo Shin