[slides] CatVTON%3A Concatenation Is All You Need for Virtual Try-On with Diffusion Models

CatVTON is a lightweight and efficient virtual try-on diffusion model that achieves high-quality try-on results by simply concatenating garment and person images in spatial dimensions. The model eliminates the need for additional image encoders or ReferenceNet, reducing the number of trainable parameters to 49.57M. It uses a single UNet backbone, removing unnecessary modules like text encoders and cross-attentions, and simplifying the inference process by removing pre-processing steps. CatVTON demonstrates superior performance in both qualitative and quantitative evaluations, outperforming state-of-the-art methods on datasets like VITON-HD and DressCode. The model also shows good generalization in real-world scenarios, handling complex textures, occlusions, and in-the-wild conditions effectively.CatVTON is a lightweight and efficient virtual try-on diffusion model that achieves high-quality try-on results by simply concatenating garment and person images in spatial dimensions. The model eliminates the need for additional image encoders or ReferenceNet, reducing the number of trainable parameters to 49.57M. It uses a single UNet backbone, removing unnecessary modules like text encoders and cross-attentions, and simplifying the inference process by removing pre-processing steps. CatVTON demonstrates superior performance in both qualitative and quantitative evaluations, outperforming state-of-the-art methods on datasets like VITON-HD and DressCode. The model also shows good generalization in real-world scenarios, handling complex textures, occlusions, and in-the-wild conditions effectively.

CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

21 Jul 2024 | Zheng Chong1,3, Xiao Dong1, Haoxiang Li2, Shiyue Zhang1, Wenqing Zhang1, Xujie Zhang1, Hanqing Zhao3,4, Xiaodan Liang1,3*