AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

28 Nov 2017 | Tao Xu1*, Pengchuan Zhang2, Qiuyuan Huang2, Han Zhang3, Zhe Gan4, Xiaolei Huang1, Xiaodong He2
AttnGAN is a novel attentional generative adversarial network designed for fine-grained text-to-image generation. It enables multi-stage refinement by focusing on relevant words in the text description to generate detailed images. The model includes an attentional generative network and a deep attentional multimodal similarity model (DAMSM) to compute a fine-grained image-text matching loss. The attentional generative network uses an attention mechanism to generate images at different resolutions by focusing on relevant words. The DAMSM computes similarity between generated images and text descriptions using both global and fine-grained information. AttnGAN significantly outperforms previous state-of-the-art models, achieving a 14.14% improvement on the CUB dataset and a 170.25% improvement on the COCO dataset. The model's attention mechanism allows it to automatically select word-level conditions for generating different image regions. Experimental results show that AttnGAN generates high-quality images with detailed features, demonstrating its effectiveness in text-to-image synthesis. The model's attention mechanism also helps stabilize the training process and improve the quality of generated images. AttnGAN is able to generate images for complex scenarios and novel situations, showing its generalization ability. The model's performance is evaluated using inception score and R-precision, with AttnGAN achieving high scores on both metrics. The model's attention mechanism is crucial for generating high-resolution images and capturing fine-grained details in text-to-image synthesis.AttnGAN is a novel attentional generative adversarial network designed for fine-grained text-to-image generation. It enables multi-stage refinement by focusing on relevant words in the text description to generate detailed images. The model includes an attentional generative network and a deep attentional multimodal similarity model (DAMSM) to compute a fine-grained image-text matching loss. The attentional generative network uses an attention mechanism to generate images at different resolutions by focusing on relevant words. The DAMSM computes similarity between generated images and text descriptions using both global and fine-grained information. AttnGAN significantly outperforms previous state-of-the-art models, achieving a 14.14% improvement on the CUB dataset and a 170.25% improvement on the COCO dataset. The model's attention mechanism allows it to automatically select word-level conditions for generating different image regions. Experimental results show that AttnGAN generates high-quality images with detailed features, demonstrating its effectiveness in text-to-image synthesis. The model's attention mechanism also helps stabilize the training process and improve the quality of generated images. AttnGAN is able to generate images for complex scenarios and novel situations, showing its generalization ability. The model's performance is evaluated using inception score and R-precision, with AttnGAN achieving high scores on both metrics. The model's attention mechanism is crucial for generating high-resolution images and capturing fine-grained details in text-to-image synthesis.
Reach us at info@study.space