August 7–11, 2022, Vancouver, BC, Canada | Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David Fleet, Mohammad Norouzi
This paper introduces Palette, a unified framework for image-to-image translation based on conditional diffusion models. The authors evaluate Palette on four challenging tasks: colorization, inpainting, uncropping, and JPEG restoration. Despite not requiring task-specific hyper-parameter tuning, architecture customization, or auxiliary losses, Palette outperforms strong GAN and regression baselines. The study highlights the impact of L2 vs. L1 loss in the denoising diffusion objective on sample diversity and demonstrates the importance of self-attention in the neural architecture. A standardized evaluation protocol based on ImageNet is proposed, incorporating human evaluation and sample quality metrics (FID, Inception Score, Classification Accuracy, and Perceptual Distance). The paper also shows that a generalist, multi-task diffusion model performs as well or better than task-specific specialist models. The authors advocate for a standardized evaluation protocol to advance image-to-image translation research.This paper introduces Palette, a unified framework for image-to-image translation based on conditional diffusion models. The authors evaluate Palette on four challenging tasks: colorization, inpainting, uncropping, and JPEG restoration. Despite not requiring task-specific hyper-parameter tuning, architecture customization, or auxiliary losses, Palette outperforms strong GAN and regression baselines. The study highlights the impact of L2 vs. L1 loss in the denoising diffusion objective on sample diversity and demonstrates the importance of self-attention in the neural architecture. A standardized evaluation protocol based on ImageNet is proposed, incorporating human evaluation and sample quality metrics (FID, Inception Score, Classification Accuracy, and Perceptual Distance). The paper also shows that a generalist, multi-task diffusion model performs as well or better than task-specific specialist models. The authors advocate for a standardized evaluation protocol to advance image-to-image translation research.