Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

November 1-5, 2016 | Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach
Multimodal Compact Bilinear Pooling (MCB) is proposed for visual question answering (VQA) and visual grounding tasks. The method combines visual and textual representations using an efficient and expressive approach that approximates the outer product of vectors through random projection and Fast Fourier Transform (FFT) convolution. This allows for a more expressive interaction between modalities compared to traditional methods like element-wise product or concatenation. The MCB approach is evaluated on VQA and visual grounding tasks, showing significant improvements over existing methods. For VQA, the model uses MCB twice: once to predict spatial attention over visual features and again to combine the attended representation with the question representation. This architecture outperforms the state-of-the-art on the Visual7W dataset and the VQA challenge. For visual grounding, MCB is used to combine visual and textual representations, leading to improved phrase localization accuracy. The method is also effective in multiple-choice question answering, where an additional MCB layer is used to relate the encoded answer to the question-image space. The model is evaluated on multiple datasets, including VQA and Visual Genome, and shows superior performance compared to non-bilinear pooling methods. The code for the model is available for replication.Multimodal Compact Bilinear Pooling (MCB) is proposed for visual question answering (VQA) and visual grounding tasks. The method combines visual and textual representations using an efficient and expressive approach that approximates the outer product of vectors through random projection and Fast Fourier Transform (FFT) convolution. This allows for a more expressive interaction between modalities compared to traditional methods like element-wise product or concatenation. The MCB approach is evaluated on VQA and visual grounding tasks, showing significant improvements over existing methods. For VQA, the model uses MCB twice: once to predict spatial attention over visual features and again to combine the attended representation with the question representation. This architecture outperforms the state-of-the-art on the Visual7W dataset and the VQA challenge. For visual grounding, MCB is used to combine visual and textual representations, leading to improved phrase localization accuracy. The method is also effective in multiple-choice question answering, where an additional MCB layer is used to relate the encoded answer to the question-image space. The model is evaluated on multiple datasets, including VQA and Visual Genome, and shows superior performance compared to non-bilinear pooling methods. The code for the model is available for replication.
Reach us at info@study.space