[slides] DisCo-Diff%3A Enhancing Continuous Diffusion Models with Discrete Latents

**DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents** Diffusion models (DMs) have revolutionized generative learning by encoding data into a simple Gaussian distribution. However, encoding complex, potentially multimodal data into a single continuous Gaussian distribution is challenging. To address this, the authors propose Discrete-Continuous Latent Variable Diffusion Models (DisCo-Diff), which introduce complementary discrete latent variables to simplify the learning task. DisCo-Diff augment DMs with learnable discrete latents inferred by an encoder and train both the DM and encoder end-to-end. This approach does not rely on pre-trained networks, making it universally applicable. The discrete latents significantly reduce the curvature of the DM's generative ODE, simplifying the denoising task and reducing training loss, especially for large diffusion times. An autoregressive transformer models the distribution of the discrete latents, which are few in number and have small codebooks. The authors validate DisCo-Diff on toy data, image synthesis tasks, and molecular docking, finding that introducing discrete latents consistently improves model performance. For example, DisCo-Diff achieves state-of-the-art FID scores on class-conditioned ImageNet-64/128 datasets with an ODE sampler. **Contributions:** 1. Propose DisCo-Diff, a novel framework for combining discrete and continuous latent variables in DMs. 2. Extensively validate DisCo-Diff, significantly boosting model quality and achieving state-of-the-art performance on several image synthesis tasks. 3. Present detailed analyses, ablation studies, and architecture design studies to demonstrate the unique benefits of discrete latent variables. **Background:** DMs use a forward diffusion process to encode data into a Gaussian prior distribution. The authors argue that directly encoding complex, multimodal data into a single Gaussian distribution is challenging. DisCo-Diff introduces discrete latents to capture global structure and simplify the denoising task. **Architecture:** DisCo-Diff incorporates a ViT encoder for image modeling, a U-Net denoiser, and a post-hierarchical model for discrete latent conditioning. The encoder treats each discrete token as a different classification token, allowing each discrete latent to capture global image characteristics. The discrete latents are conditioned on using cross-attention layers, and a group hierarchical model enhances interpretability by feeding distinct latent groups into different resolution features. **Experiments:** - **Image Synthesis:** DisCo-Diff achieves state-of-the-art FID scores on class-conditioned ImageNet-64/128 datasets with an ODE sampler. - **Molecular Docking:** DisCo-Diff improves performance in molecular docking tasks, demonstrating its universality. **Conclusions:** DisCo-Diff significantly boosts performance by simplifying the DM's denoising task through the help of auxiliary discrete latent variables. The approach does not rely on pre-trained encoder networks and has been validated**DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents** Diffusion models (DMs) have revolutionized generative learning by encoding data into a simple Gaussian distribution. However, encoding complex, potentially multimodal data into a single continuous Gaussian distribution is challenging. To address this, the authors propose Discrete-Continuous Latent Variable Diffusion Models (DisCo-Diff), which introduce complementary discrete latent variables to simplify the learning task. DisCo-Diff augment DMs with learnable discrete latents inferred by an encoder and train both the DM and encoder end-to-end. This approach does not rely on pre-trained networks, making it universally applicable. The discrete latents significantly reduce the curvature of the DM's generative ODE, simplifying the denoising task and reducing training loss, especially for large diffusion times. An autoregressive transformer models the distribution of the discrete latents, which are few in number and have small codebooks. The authors validate DisCo-Diff on toy data, image synthesis tasks, and molecular docking, finding that introducing discrete latents consistently improves model performance. For example, DisCo-Diff achieves state-of-the-art FID scores on class-conditioned ImageNet-64/128 datasets with an ODE sampler. **Contributions:** 1. Propose DisCo-Diff, a novel framework for combining discrete and continuous latent variables in DMs. 2. Extensively validate DisCo-Diff, significantly boosting model quality and achieving state-of-the-art performance on several image synthesis tasks. 3. Present detailed analyses, ablation studies, and architecture design studies to demonstrate the unique benefits of discrete latent variables. **Background:** DMs use a forward diffusion process to encode data into a Gaussian prior distribution. The authors argue that directly encoding complex, multimodal data into a single Gaussian distribution is challenging. DisCo-Diff introduces discrete latents to capture global structure and simplify the denoising task. **Architecture:** DisCo-Diff incorporates a ViT encoder for image modeling, a U-Net denoiser, and a post-hierarchical model for discrete latent conditioning. The encoder treats each discrete token as a different classification token, allowing each discrete latent to capture global image characteristics. The discrete latents are conditioned on using cross-attention layers, and a group hierarchical model enhances interpretability by feeding distinct latent groups into different resolution features. **Experiments:** - **Image Synthesis:** DisCo-Diff achieves state-of-the-art FID scores on class-conditioned ImageNet-64/128 datasets with an ODE sampler. - **Molecular Docking:** DisCo-Diff improves performance in molecular docking tasks, demonstrating its universality. **Conclusions:** DisCo-Diff significantly boosts performance by simplifying the DM's denoising task through the help of auxiliary discrete latent variables. The approach does not rely on pre-trained encoder networks and has been validated

DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents

2024 | Yilun Xu, Gabriele Corso, Tommi Jaakkola, Arash Vahdat, Karsten Kreis