14 Mar 2018 | Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang
This paper introduces a novel combined bottom-up and top-down attention mechanism for image captioning and visual question answering (VQA). The bottom-up mechanism, based on Faster R-CNN, proposes salient image regions with associated feature vectors, while the top-down mechanism determines feature weightings. This approach enables attention to be calculated at the level of objects and other salient regions, improving the interpretability of attention weights. The authors evaluate their model on the MSCOCO test server for image captioning and the VQA v2.0 test-standard server for VQA, achieving state-of-the-art results. The model outperforms previous methods in both tasks, demonstrating the broad applicability of the combined attention mechanism.This paper introduces a novel combined bottom-up and top-down attention mechanism for image captioning and visual question answering (VQA). The bottom-up mechanism, based on Faster R-CNN, proposes salient image regions with associated feature vectors, while the top-down mechanism determines feature weightings. This approach enables attention to be calculated at the level of objects and other salient regions, improving the interpretability of attention weights. The authors evaluate their model on the MSCOCO test server for image captioning and the VQA v2.0 test-standard server for VQA, achieving state-of-the-art results. The model outperforms previous methods in both tasks, demonstrating the broad applicability of the combined attention mechanism.