26 Jan 2016 | Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Smola
This paper introduces Stacked Attention Networks (SANs) for image question answering (QA). SANs use a multi-layer attention mechanism to progressively reason through an image to answer natural language questions. The authors argue that image QA often requires multiple steps of reasoning, and thus develop a SAN that queries the image multiple times to infer the answer. Experiments on four image QA datasets show that the proposed SAN significantly outperforms previous state-of-the-art approaches. The visualization of the attention layers illustrates how SANs progressively focus on relevant visual clues to locate the answer. The main contributions of the work are the proposal of SANs, comprehensive evaluations on four image QA benchmarks, and a detailed analysis of the SAN's attention layers.This paper introduces Stacked Attention Networks (SANs) for image question answering (QA). SANs use a multi-layer attention mechanism to progressively reason through an image to answer natural language questions. The authors argue that image QA often requires multiple steps of reasoning, and thus develop a SAN that queries the image multiple times to infer the answer. Experiments on four image QA datasets show that the proposed SAN significantly outperforms previous state-of-the-art approaches. The visualization of the attention layers illustrates how SANs progressively focus on relevant visual clues to locate the answer. The main contributions of the work are the proposal of SANs, comprehensive evaluations on four image QA benchmarks, and a detailed analysis of the SAN's attention layers.