2024 | Yilun Xu, Gabriele Corso, Tommi Jaakkola, Arash Vahdat, Karsten Kreis
DisCo-Diff is a novel framework that enhances continuous diffusion models (DMs) by introducing discrete latent variables. The framework combines discrete and continuous latent variables to simplify the DM's denoising task, making it more efficient and effective. DisCo-Diff uses learnable discrete latents inferred through an encoder, which are trained end-to-end with the DM. This approach reduces the curvature of the DM's generative ODE and improves model performance. The discrete latents are modeled using an autoregressive transformer, which simplifies the learning process. DisCo-Diff has been validated on various tasks, including image synthesis and molecular docking, where it consistently outperforms existing methods. The framework is universal and does not rely on pre-trained networks, making it applicable to a wide range of tasks. DisCo-Diff's discrete latents capture global structure and provide complementary information to semantic conditioning, leading to improved performance. The framework is also efficient, with fast inference times and minimal overhead. Overall, DisCo-Diff demonstrates the benefits of combining discrete and continuous latent variables in diffusion models, offering a flexible and effective approach to generative modeling.DisCo-Diff is a novel framework that enhances continuous diffusion models (DMs) by introducing discrete latent variables. The framework combines discrete and continuous latent variables to simplify the DM's denoising task, making it more efficient and effective. DisCo-Diff uses learnable discrete latents inferred through an encoder, which are trained end-to-end with the DM. This approach reduces the curvature of the DM's generative ODE and improves model performance. The discrete latents are modeled using an autoregressive transformer, which simplifies the learning process. DisCo-Diff has been validated on various tasks, including image synthesis and molecular docking, where it consistently outperforms existing methods. The framework is universal and does not rely on pre-trained networks, making it applicable to a wide range of tasks. DisCo-Diff's discrete latents capture global structure and provide complementary information to semantic conditioning, leading to improved performance. The framework is also efficient, with fast inference times and minimal overhead. Overall, DisCo-Diff demonstrates the benefits of combining discrete and continuous latent variables in diffusion models, offering a flexible and effective approach to generative modeling.