21 Mar 2018 | Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, Thomas S. Huang
The paper presents a novel deep generative model for image inpainting that explicitly utilizes contextual attention to improve the quality of missing regions. The proposed method addresses the limitations of existing deep learning-based approaches, which often produce distorted structures or blurry textures due to their inability to effectively borrow information from distant spatial locations. The model consists of a feed-forward, fully convolutional neural network that can handle images with multiple holes at arbitrary locations and variable sizes. The key contribution is the contextual attention layer, which uses features from known patches as convolutional filters to process generated patches, ensuring better coherence with the surrounding context. The network is trained using a combination of reconstruction losses and two Wasserstein GAN losses, one focusing on global image consistency and the other on local patch consistency. Experiments on various datasets, including faces (CelebA, CelebA-HQ), textures (DTD), and natural images (ImageNet, Places2), demonstrate that the proposed method generates higher-quality inpainting results compared to existing methods. The paper also includes ablation studies to validate the effectiveness of each component of the model.The paper presents a novel deep generative model for image inpainting that explicitly utilizes contextual attention to improve the quality of missing regions. The proposed method addresses the limitations of existing deep learning-based approaches, which often produce distorted structures or blurry textures due to their inability to effectively borrow information from distant spatial locations. The model consists of a feed-forward, fully convolutional neural network that can handle images with multiple holes at arbitrary locations and variable sizes. The key contribution is the contextual attention layer, which uses features from known patches as convolutional filters to process generated patches, ensuring better coherence with the surrounding context. The network is trained using a combination of reconstruction losses and two Wasserstein GAN losses, one focusing on global image consistency and the other on local patch consistency. Experiments on various datasets, including faces (CelebA, CelebA-HQ), textures (DTD), and natural images (ImageNet, Places2), demonstrate that the proposed method generates higher-quality inpainting results compared to existing methods. The paper also includes ablation studies to validate the effectiveness of each component of the model.