4 Jan 2024 | Ling Yang, Jingwei Liu, Shenda Hong, Zhilong Zhang, Zhilin Huang, Zheming Cai, Wentao Zhang, Bin Cui
The paper "Improving Diffusion-Based Image Synthesis with Context Prediction" introduces CONPREDIFF, a novel approach to enhance diffusion-based image synthesis by explicitly predicting neighborhood context. The authors address the limitation of existing diffusion models, which primarily focus on point-wise reconstruction, often neglecting the preservation of local context and semantic distribution. To overcome this, CONPREDIFF incorporates a context decoder at the end of diffusion denoising blocks during training, enabling each pixel/feature to predict its multi-stride neighborhood context. This approach ensures that each predicted pixel/feature is better reconstructed by preserving its semantic connections with the surrounding context.
The proposed method is designed to be efficient and can be applied to both discrete and continuous diffusion backbones without introducing additional parameters in the inference stage. The neighborhood context is characterized as a probability distribution over multi-stride neighbors, and an optimal-transport loss based on Wasserstein distance is used to optimize the decoding process. Extensive experiments on unconditional image generation, text-to-image generation, and image inpainting tasks demonstrate that CONPREDIFF consistently outperforms previous methods, achieving state-of-the-art results on the MS-COCO dataset with a zero-shot FID score of 6.21.
Key contributions of the paper include:
1. The first proposal of CONPREDIFF to improve diffusion-based image generation with context prediction.
2. An efficient approach to decode large context using an optimal-transport loss based on Wasserstein distance.
3. Significant performance improvements over existing diffusion models, achieving new SOTA results in image generation tasks.
The paper also discusses the impact and efficiency of context prediction, showing that it significantly enhances the quality and diversity of generated images while maintaining computational efficiency.The paper "Improving Diffusion-Based Image Synthesis with Context Prediction" introduces CONPREDIFF, a novel approach to enhance diffusion-based image synthesis by explicitly predicting neighborhood context. The authors address the limitation of existing diffusion models, which primarily focus on point-wise reconstruction, often neglecting the preservation of local context and semantic distribution. To overcome this, CONPREDIFF incorporates a context decoder at the end of diffusion denoising blocks during training, enabling each pixel/feature to predict its multi-stride neighborhood context. This approach ensures that each predicted pixel/feature is better reconstructed by preserving its semantic connections with the surrounding context.
The proposed method is designed to be efficient and can be applied to both discrete and continuous diffusion backbones without introducing additional parameters in the inference stage. The neighborhood context is characterized as a probability distribution over multi-stride neighbors, and an optimal-transport loss based on Wasserstein distance is used to optimize the decoding process. Extensive experiments on unconditional image generation, text-to-image generation, and image inpainting tasks demonstrate that CONPREDIFF consistently outperforms previous methods, achieving state-of-the-art results on the MS-COCO dataset with a zero-shot FID score of 6.21.
Key contributions of the paper include:
1. The first proposal of CONPREDIFF to improve diffusion-based image generation with context prediction.
2. An efficient approach to decode large context using an optimal-transport loss based on Wasserstein distance.
3. Significant performance improvements over existing diffusion models, achieving new SOTA results in image generation tasks.
The paper also discusses the impact and efficiency of context prediction, showing that it significantly enhances the quality and diversity of generated images while maintaining computational efficiency.