GVA: guided visual attention approach for automatic image caption generation

GVA: guided visual attention approach for automatic image caption generation

29 January 2024 | Md. Bipul Hossen, Zhongfu Ye, Amr Abdussalam, Md. Imran Hossain
The paper introduces a novel approach called Guided Visual Attention (GVA) for automatic image caption generation. This method aims to enhance the quality of captions by incorporating an additional attention mechanism to re-adjust the attentional weights on visual feature vectors. The GVA approach uses the first-level attention module as guidance and re-weights the attention weights, significantly improving the caption's quality. The model employs an encoder-decoder architecture with Faster R-CNN for feature extraction and a visual attention-based LSTM in the decoder. Extensive experiments on the MS-COCO and Flickr30k datasets show that the GVA approach achieves average improvements of 2.4% on BLEU@1 and 13.24% on CIDEr for the MSCOCO dataset, and 4.6% on BLEU@1 and 12.48% on CIDEr for the Flickr30K dataset. The code for this approach is available on GitHub. The paper addresses the limitations of existing attention mechanisms, such as 'attention drift,' and provides a more precise and detailed caption generation method.The paper introduces a novel approach called Guided Visual Attention (GVA) for automatic image caption generation. This method aims to enhance the quality of captions by incorporating an additional attention mechanism to re-adjust the attentional weights on visual feature vectors. The GVA approach uses the first-level attention module as guidance and re-weights the attention weights, significantly improving the caption's quality. The model employs an encoder-decoder architecture with Faster R-CNN for feature extraction and a visual attention-based LSTM in the decoder. Extensive experiments on the MS-COCO and Flickr30k datasets show that the GVA approach achieves average improvements of 2.4% on BLEU@1 and 13.24% on CIDEr for the MSCOCO dataset, and 4.6% on BLEU@1 and 12.48% on CIDEr for the Flickr30K dataset. The code for this approach is available on GitHub. The paper addresses the limitations of existing attention mechanisms, such as 'attention drift,' and provides a more precise and detailed caption generation method.
Reach us at info@study.space