2022 | Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer
The paper introduces Latent Diffusion Models (LDMs), a novel approach to high-resolution image synthesis that leverages the efficiency of diffusion models (DMs) while reducing computational costs. DMs, which are based on a sequence of denoising autoencoders, have achieved state-of-the-art results in image synthesis but are computationally expensive to train and infer. LDMs address this by training DMs in the latent space of a pre-trained autoencoder, which reduces the dimensionality of the data and speeds up inference. This approach allows for a better balance between complexity reduction and detail preservation, leading to improved visual fidelity. The authors introduce cross-attention layers to enable flexible conditioning on various inputs such as text or bounding boxes, making LDMs suitable for tasks like text-to-image synthesis, unconditional image generation, and super-resolution. LDMs achieve competitive performance on multiple tasks while significantly reducing computational requirements compared to pixel-based DMs. The paper also includes experiments demonstrating the effectiveness of LDMs in various image synthesis tasks, including image inpainting and class-conditional image synthesis.The paper introduces Latent Diffusion Models (LDMs), a novel approach to high-resolution image synthesis that leverages the efficiency of diffusion models (DMs) while reducing computational costs. DMs, which are based on a sequence of denoising autoencoders, have achieved state-of-the-art results in image synthesis but are computationally expensive to train and infer. LDMs address this by training DMs in the latent space of a pre-trained autoencoder, which reduces the dimensionality of the data and speeds up inference. This approach allows for a better balance between complexity reduction and detail preservation, leading to improved visual fidelity. The authors introduce cross-attention layers to enable flexible conditioning on various inputs such as text or bounding boxes, making LDMs suitable for tasks like text-to-image synthesis, unconditional image generation, and super-resolution. LDMs achieve competitive performance on multiple tasks while significantly reducing computational requirements compared to pixel-based DMs. The paper also includes experiments demonstrating the effectiveness of LDMs in various image synthesis tasks, including image inpainting and class-conditional image synthesis.