Improving Diffusion-Based Image Synthesis with Context Prediction

Improving Diffusion-Based Image Synthesis with Context Prediction

2024-01-04 | Ling Yang, Jingwei Liu, Shenda Hong, Zhilong Zhang, Zheming Cai, Wentao Zhang, Bin Cui, Zhilin Huang
This paper proposes CONPREDIFF, a novel method to improve diffusion-based image synthesis by incorporating context prediction. The method explicitly forces each point in the image to predict its neighborhood context during training, using a context decoder at the end of diffusion denoising blocks. This approach allows each point to better reconstruct itself by preserving its semantic connections with the neighborhood context. The proposed method can be generalized to both discrete and continuous diffusion backbones without introducing additional parameters in the inference stage. The method is evaluated on three major visual tasks: unconditional image generation, text-to-image generation, and image inpainting. The results show that CONPREDIFF consistently outperforms previous methods, achieving a new state-of-the-art text-to-image generation result on the MS-COCO dataset with a zero-shot FID score of 6.21. The method also demonstrates improved performance in image inpainting and unconditional image generation. The key contributions of this work include the first proposal of CONPREDIFF for improving diffusion-based image synthesis with context prediction, an efficient approach to decode large context using an optimal-transport loss based on Wasserstein distance, and the demonstration that CONPREDIFF substantially outperforms existing diffusion models and achieves new state-of-the-art results in image generation. The method is also shown to be computationally efficient, as it does not introduce extra parameters in the inference stage.This paper proposes CONPREDIFF, a novel method to improve diffusion-based image synthesis by incorporating context prediction. The method explicitly forces each point in the image to predict its neighborhood context during training, using a context decoder at the end of diffusion denoising blocks. This approach allows each point to better reconstruct itself by preserving its semantic connections with the neighborhood context. The proposed method can be generalized to both discrete and continuous diffusion backbones without introducing additional parameters in the inference stage. The method is evaluated on three major visual tasks: unconditional image generation, text-to-image generation, and image inpainting. The results show that CONPREDIFF consistently outperforms previous methods, achieving a new state-of-the-art text-to-image generation result on the MS-COCO dataset with a zero-shot FID score of 6.21. The method also demonstrates improved performance in image inpainting and unconditional image generation. The key contributions of this work include the first proposal of CONPREDIFF for improving diffusion-based image synthesis with context prediction, an efficient approach to decode large context using an optimal-transport loss based on Wasserstein distance, and the demonstration that CONPREDIFF substantially outperforms existing diffusion models and achieves new state-of-the-art results in image generation. The method is also shown to be computationally efficient, as it does not introduce extra parameters in the inference stage.
Reach us at info@study.space