Palette: Image-to-Image Diffusion Models

Palette: Image-to-Image Diffusion Models

August 7–11, 2022 | Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David Fleet, Mohammad Norouzi
This paper presents Palette, a unified framework for image-to-image translation based on conditional diffusion models. The framework is evaluated on four challenging tasks: colorization, inpainting, uncropping, and JPEG restoration. Palette outperforms strong GAN and regression baselines on all tasks without task-specific hyper-parameter tuning, architecture customization, or auxiliary loss functions. The study reveals the impact of L2 vs. L1 loss in the denoising objective on sample diversity and highlights the importance of self-attention in the neural architecture. A standardized evaluation protocol based on ImageNet is advocated, including human evaluation and sample quality scores (FID, Inception Score, Classification Accuracy of a pre-trained ResNet-50, and Perceptual Distance against original images). The results show that a generalist, multi-task diffusion model performs as well or better than task-specific specialists. Palette is shown to generate high-fidelity outputs across all four tasks without task-specific customization. The study also demonstrates that a single generalist Palette model trained on colorization, inpainting, and JPEG restoration outperforms a task-specific JPEG model and achieves competitive performance on the other tasks. The paper also explores the impact of different loss functions on sample diversity and the effectiveness of self-attention in the neural architecture. The results indicate that L2 loss leads to higher sample diversity compared to L1 loss. The study further shows that Palette is robust and can generate realistic and coherent outputs even after multiple applications of uncropping. The paper concludes that Palette is a simple, general framework for image-to-image translation that achieves strong results on four challenging tasks.This paper presents Palette, a unified framework for image-to-image translation based on conditional diffusion models. The framework is evaluated on four challenging tasks: colorization, inpainting, uncropping, and JPEG restoration. Palette outperforms strong GAN and regression baselines on all tasks without task-specific hyper-parameter tuning, architecture customization, or auxiliary loss functions. The study reveals the impact of L2 vs. L1 loss in the denoising objective on sample diversity and highlights the importance of self-attention in the neural architecture. A standardized evaluation protocol based on ImageNet is advocated, including human evaluation and sample quality scores (FID, Inception Score, Classification Accuracy of a pre-trained ResNet-50, and Perceptual Distance against original images). The results show that a generalist, multi-task diffusion model performs as well or better than task-specific specialists. Palette is shown to generate high-fidelity outputs across all four tasks without task-specific customization. The study also demonstrates that a single generalist Palette model trained on colorization, inpainting, and JPEG restoration outperforms a task-specific JPEG model and achieves competitive performance on the other tasks. The paper also explores the impact of different loss functions on sample diversity and the effectiveness of self-attention in the neural architecture. The results indicate that L2 loss leads to higher sample diversity compared to L1 loss. The study further shows that Palette is robust and can generate realistic and coherent outputs even after multiple applications of uncropping. The paper concludes that Palette is a simple, general framework for image-to-image translation that achieves strong results on four challenging tasks.
Reach us at info@study.space