Multimodal Unsupervised Image-to-Image Translation

Multimodal Unsupervised Image-to-Image Translation

14 Aug 2018 | Xun Huang, Ming-Yu Liu, Serge Belongie, Jan Kautz
This paper introduces a framework for multimodal unsupervised image-to-image translation called MUNIT. The goal is to learn the conditional distribution of corresponding images in the target domain without seeing any examples of corresponding image pairs. Existing approaches assume a deterministic one-to-one mapping, which limits the diversity of outputs. MUNIT decomposes the image representation into a content code (domain-invariant) and a style code (domain-specific). To translate an image, the content code is recombined with a random style code from the target domain's style space. This allows for diverse and multimodal outputs. The framework is analyzed theoretically and validated through experiments, showing superior performance compared to state-of-the-art methods. The model also allows users to control the style of translation outputs by providing an example style image. The framework is implemented in PyTorch and available at https://github.com/nvlabs/MUNIT. The paper discusses related works, including GANs, image-to-image translation, style transfer, and learning disentangled representations. The theoretical analysis shows that the framework matches latent distributions, induces joint image distributions, and enforces a weak form of cycle consistency. The experiments demonstrate the effectiveness of the framework in modeling multimodal output distributions and achieving high-quality image translations. The model is evaluated using human preference, LPIPS distance, and Inception Score metrics, showing superior performance compared to baselines. The framework is applied to various domains, including edges ↔ shoes/handbags, animal image translation, and street scene images. The results show that MUNIT produces diverse and realistic images, outperforming existing methods in both quality and diversity. The framework is also compared to existing style transfer methods, showing that it produces more faithful and realistic results. The paper concludes that MUNIT is a promising approach for multimodal unsupervised image-to-image translation.This paper introduces a framework for multimodal unsupervised image-to-image translation called MUNIT. The goal is to learn the conditional distribution of corresponding images in the target domain without seeing any examples of corresponding image pairs. Existing approaches assume a deterministic one-to-one mapping, which limits the diversity of outputs. MUNIT decomposes the image representation into a content code (domain-invariant) and a style code (domain-specific). To translate an image, the content code is recombined with a random style code from the target domain's style space. This allows for diverse and multimodal outputs. The framework is analyzed theoretically and validated through experiments, showing superior performance compared to state-of-the-art methods. The model also allows users to control the style of translation outputs by providing an example style image. The framework is implemented in PyTorch and available at https://github.com/nvlabs/MUNIT. The paper discusses related works, including GANs, image-to-image translation, style transfer, and learning disentangled representations. The theoretical analysis shows that the framework matches latent distributions, induces joint image distributions, and enforces a weak form of cycle consistency. The experiments demonstrate the effectiveness of the framework in modeling multimodal output distributions and achieving high-quality image translations. The model is evaluated using human preference, LPIPS distance, and Inception Score metrics, showing superior performance compared to baselines. The framework is applied to various domains, including edges ↔ shoes/handbags, animal image translation, and street scene images. The results show that MUNIT produces diverse and realistic images, outperforming existing methods in both quality and diversity. The framework is also compared to existing style transfer methods, showing that it produces more faithful and realistic results. The paper concludes that MUNIT is a promising approach for multimodal unsupervised image-to-image translation.
Reach us at info@study.space