CatVTON is a lightweight and efficient virtual try-on diffusion model that achieves high-quality try-on results by simply concatenating garment and person images in spatial dimensions. The model eliminates the need for additional image encoders or ReferenceNet, reducing the number of trainable parameters to 49.57M. It uses a single UNet backbone, removing unnecessary modules like text encoders and cross-attentions, and simplifying the inference process by removing pre-processing steps. CatVTON demonstrates superior performance in both qualitative and quantitative evaluations, outperforming state-of-the-art methods on datasets like VITON-HD and DressCode. The model also shows good generalization in real-world scenarios, handling complex textures, occlusions, and in-the-wild conditions effectively.CatVTON is a lightweight and efficient virtual try-on diffusion model that achieves high-quality try-on results by simply concatenating garment and person images in spatial dimensions. The model eliminates the need for additional image encoders or ReferenceNet, reducing the number of trainable parameters to 49.57M. It uses a single UNet backbone, removing unnecessary modules like text encoders and cross-attentions, and simplifying the inference process by removing pre-processing steps. CatVTON demonstrates superior performance in both qualitative and quantitative evaluations, outperforming state-of-the-art methods on datasets like VITON-HD and DressCode. The model also shows good generalization in real-world scenarios, handling complex textures, occlusions, and in-the-wild conditions effectively.