This paper presents IDM-VTON, a novel diffusion model for generating authentic virtual try-on images in real-world scenarios. The proposed method improves garment fidelity and generates more realistic virtual try-on images by incorporating two attention modules: one for high-level semantics and another for low-level features. The model uses a base UNet, an image prompt adapter to encode garment semantics, and a garment UNet feature encoder to extract low-level features. Detailed textual prompts for both garment and person images are provided to enhance the authenticity of the generated visuals. Additionally, a customization method is introduced to improve the model's performance on real-world scenarios.
The model is trained on the VITON-HD dataset and tested on both VITON-HD and DressCode test datasets. The results show that IDM-VTON outperforms previous methods in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. The proposed customization method also demonstrates effectiveness in real-world scenarios. The model is further evaluated on an In-the-Wild dataset, which contains garments with intricate patterns and human images with various poses and gestures. IDM-VTON outperforms other methods on this dataset, particularly when customized.
The paper also presents ablation studies showing the effectiveness of the garment feature encoder and detailed captions in improving the model's performance. The results indicate that IDM-VTON significantly outperforms other methods in preserving garment details and generating high-fidelity images. The model is also compared with other diffusion-based methods, showing its potential in virtual try-on in the wild. The paper concludes that IDM-VTON is a promising approach for generating authentic virtual try-on images in real-world scenarios.This paper presents IDM-VTON, a novel diffusion model for generating authentic virtual try-on images in real-world scenarios. The proposed method improves garment fidelity and generates more realistic virtual try-on images by incorporating two attention modules: one for high-level semantics and another for low-level features. The model uses a base UNet, an image prompt adapter to encode garment semantics, and a garment UNet feature encoder to extract low-level features. Detailed textual prompts for both garment and person images are provided to enhance the authenticity of the generated visuals. Additionally, a customization method is introduced to improve the model's performance on real-world scenarios.
The model is trained on the VITON-HD dataset and tested on both VITON-HD and DressCode test datasets. The results show that IDM-VTON outperforms previous methods in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. The proposed customization method also demonstrates effectiveness in real-world scenarios. The model is further evaluated on an In-the-Wild dataset, which contains garments with intricate patterns and human images with various poses and gestures. IDM-VTON outperforms other methods on this dataset, particularly when customized.
The paper also presents ablation studies showing the effectiveness of the garment feature encoder and detailed captions in improving the model's performance. The results indicate that IDM-VTON significantly outperforms other methods in preserving garment details and generating high-fidelity images. The model is also compared with other diffusion-based methods, showing its potential in virtual try-on in the wild. The paper concludes that IDM-VTON is a promising approach for generating authentic virtual try-on images in real-world scenarios.