12 Apr 2017 | Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, Tat-Seng Chua
The paper introduces SCA-CNN, a novel convolutional neural network that incorporates spatial and channel-wise attention mechanisms to enhance image captioning. Unlike traditional spatial attention models that only modulate the last conv-layer feature map, SCA-CNN dynamically adjusts multi-layer feature maps based on the sentence context, focusing on both spatial locations and channels. The authors argue that this approach better aligns with the dynamic nature of visual attention, which combines contextual fixations over time. SCA-CNN is evaluated on three benchmark datasets (Flickr8K, Flickr30K, and MSCOCO) and consistently outperforms state-of-the-art visual attention-based image captioning methods. The paper also discusses the effectiveness of channel-wise attention and multi-layer attention, providing ablation studies and comparisons with other attention models. Visualizations of the attention mechanisms are included to illustrate how SCA-CNN attends to specific regions and features in images.The paper introduces SCA-CNN, a novel convolutional neural network that incorporates spatial and channel-wise attention mechanisms to enhance image captioning. Unlike traditional spatial attention models that only modulate the last conv-layer feature map, SCA-CNN dynamically adjusts multi-layer feature maps based on the sentence context, focusing on both spatial locations and channels. The authors argue that this approach better aligns with the dynamic nature of visual attention, which combines contextual fixations over time. SCA-CNN is evaluated on three benchmark datasets (Flickr8K, Flickr30K, and MSCOCO) and consistently outperforms state-of-the-art visual attention-based image captioning methods. The paper also discusses the effectiveness of channel-wise attention and multi-layer attention, providing ablation studies and comparisons with other attention models. Visualizations of the attention mechanisms are included to illustrate how SCA-CNN attends to specific regions and features in images.