Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On
This paper proposes a Texture-Preserving Diffusion (TPD) model for high-fidelity virtual try-on, which enhances the fidelity of try-on results without using additional image encoders. The TPD model utilizes the self-attention blocks within the diffusion model to achieve efficient and accurate texture transfer from the garment to the person image. The model also introduces a novel diffusion-based method for predicting an accurate inpainting mask based on the person and reference garment images, further enhancing the reliability of the try-on results. The TPD model is integrated into a single compact model, enabling the synthesis of high-fidelity try-on images with complex textures, patterns, and challenging body pose variations.
The TPD model is evaluated on the VITON and VITON-HD databases, and it significantly outperforms state-of-the-art methods in terms of realism and coherence of the synthesized images. The model is also applied to various try-on tasks, including garment-to-person and person-to-person try-ons. The TPD model's key contributions include a novel diffusion-based and warping-free method for virtual try-on, an exploration of the effect of coarse inpainting masks on the fidelity of synthesized images, and a novel method for accurate mask prediction.
The TPD model is based on the Stable Diffusion (SD) model and uses the self-attention blocks in the denoising UNet to capture long-range correlations among pixels in the combined image. The model also introduces a Decoupled Mask Prediction (DMP) method that automatically determines an accurate inpainting area for each person-garment image pair. The DMP method enables the model to preserve as much identity information as possible by adapting to the garment it encounters.
The TPD model is evaluated on three virtual try-on benchmarks: VITON, VITON-HD, and DeepFashion. The results show that the TPD model consistently achieves the best performance on both VITON and VITON-HD databases. The model also performs well on the DeepFashion database for the person-to-person virtual try-on task, which involves fitting the garment on a person to another person's body.
The TPD model is compared with state-of-the-art methods in paired and unpaired settings. In the paired setting, the person in S wears the same garment as the reference image. In the unpaired setting, the reference garment is different from the original one in S. The model is evaluated using Structural Similarity (SSIM), Learned Perceptual Image Patch Similarity (LPIPS), and Frechet Inception Distance (FID) metrics. The results show that the TPD model achieves the best performance in terms of realism and fidelity of the synthesized images.
The TPD model is also evaluated through an ablation study, which validates the effectiveness of each key component of the model. The results show that the self-attention-based texture transferTexture-Preserving Diffusion Models for High-Fidelity Virtual Try-On
This paper proposes a Texture-Preserving Diffusion (TPD) model for high-fidelity virtual try-on, which enhances the fidelity of try-on results without using additional image encoders. The TPD model utilizes the self-attention blocks within the diffusion model to achieve efficient and accurate texture transfer from the garment to the person image. The model also introduces a novel diffusion-based method for predicting an accurate inpainting mask based on the person and reference garment images, further enhancing the reliability of the try-on results. The TPD model is integrated into a single compact model, enabling the synthesis of high-fidelity try-on images with complex textures, patterns, and challenging body pose variations.
The TPD model is evaluated on the VITON and VITON-HD databases, and it significantly outperforms state-of-the-art methods in terms of realism and coherence of the synthesized images. The model is also applied to various try-on tasks, including garment-to-person and person-to-person try-ons. The TPD model's key contributions include a novel diffusion-based and warping-free method for virtual try-on, an exploration of the effect of coarse inpainting masks on the fidelity of synthesized images, and a novel method for accurate mask prediction.
The TPD model is based on the Stable Diffusion (SD) model and uses the self-attention blocks in the denoising UNet to capture long-range correlations among pixels in the combined image. The model also introduces a Decoupled Mask Prediction (DMP) method that automatically determines an accurate inpainting area for each person-garment image pair. The DMP method enables the model to preserve as much identity information as possible by adapting to the garment it encounters.
The TPD model is evaluated on three virtual try-on benchmarks: VITON, VITON-HD, and DeepFashion. The results show that the TPD model consistently achieves the best performance on both VITON and VITON-HD databases. The model also performs well on the DeepFashion database for the person-to-person virtual try-on task, which involves fitting the garment on a person to another person's body.
The TPD model is compared with state-of-the-art methods in paired and unpaired settings. In the paired setting, the person in S wears the same garment as the reference image. In the unpaired setting, the reference garment is different from the original one in S. The model is evaluated using Structural Similarity (SSIM), Learned Perceptual Image Patch Similarity (LPIPS), and Frechet Inception Distance (FID) metrics. The results show that the TPD model achieves the best performance in terms of realism and fidelity of the synthesized images.
The TPD model is also evaluated through an ablation study, which validates the effectiveness of each key component of the model. The results show that the self-attention-based texture transfer