[slides and audio] Cascaded Diffusion Models for High Fidelity Image Generation

The paper introduces a novel approach called Cascaded Diffusion Models (CDMs) to generate high-fidelity images on the class-conditional ImageNet generation benchmark. CDMs consist of a pipeline of multiple diffusion models that generate images of increasing resolution, starting with a standard diffusion model at the lowest resolution and followed by one or more super-resolution models that upsample the image and add higher-resolution details. The key contribution is the use of *conditioning augmentation*, a technique to enhance the conditioning inputs of super-resolution models during training. Conditioning augmentation prevents compounding error in cascading pipelines, leading to improved sample quality. The authors find that this technique is crucial for achieving high-quality samples at the highest resolution. Their experiments show that CDMs achieve FID scores of 1.48 at 64×64, 3.52 at 128×128, and 4.88 at 256×256, outperforming BigGAN-deep. Additionally, the models achieve classification accuracy scores of 63.02% (top-1) and 84.06% (top-5) at 256×256, surpassing VQ-VAE-2. The paper also explores different types of conditioning augmentation, including Gaussian augmentation and Gaussian blurring, and provides detailed hyperparameters and architectures for the models.The paper introduces a novel approach called Cascaded Diffusion Models (CDMs) to generate high-fidelity images on the class-conditional ImageNet generation benchmark. CDMs consist of a pipeline of multiple diffusion models that generate images of increasing resolution, starting with a standard diffusion model at the lowest resolution and followed by one or more super-resolution models that upsample the image and add higher-resolution details. The key contribution is the use of *conditioning augmentation*, a technique to enhance the conditioning inputs of super-resolution models during training. Conditioning augmentation prevents compounding error in cascading pipelines, leading to improved sample quality. The authors find that this technique is crucial for achieving high-quality samples at the highest resolution. Their experiments show that CDMs achieve FID scores of 1.48 at 64×64, 3.52 at 128×128, and 4.88 at 256×256, outperforming BigGAN-deep. Additionally, the models achieve classification accuracy scores of 63.02% (top-1) and 84.06% (top-5) at 256×256, surpassing VQ-VAE-2. The paper also explores different types of conditioning augmentation, including Gaussian augmentation and Gaussian blurring, and provides detailed hyperparameters and architectures for the models.

Cascaded Diffusion Models for High Fidelity Image Generation

17 Dec 2021 | Jonathan Ho*, Chitwan Saharia*, William Chan, David J. Fleet, Mohammad Norouzi, Tim Salimans

17 Dec 2021 | Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, Tim Salimans