Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

November 1-5, 2016 | Akira Fukui*1,2 Dong Huk Park*1 Daylen Yang*1 Anna Rohrbach*1,3 Trevor Darrell1 Marcus Rohrbach1
This paper introduces Multimodal Compact Bilinear Pooling (MCB) for visual question answering (VQA) and visual grounding tasks. MCB is proposed to efficiently and effectively combine visual and textual representations, addressing the limitations of traditional methods such as element-wise operations or concatenation. MCB leverages the outer product of vectors, allowing for multiplicative interactions between elements, which is more expressive than linear methods. The authors propose an efficient approximation of MCB using Count Sketch and Fast Fourier Transform (FFT) to handle high-dimensional representations. Extensive evaluations on VQA and visual grounding datasets show that MCB outperforms existing methods, achieving state-of-the-art results on the Visual7W dataset and the VQA challenge. The paper also discusses the benefits of attention maps and additional training data for VQA tasks.This paper introduces Multimodal Compact Bilinear Pooling (MCB) for visual question answering (VQA) and visual grounding tasks. MCB is proposed to efficiently and effectively combine visual and textual representations, addressing the limitations of traditional methods such as element-wise operations or concatenation. MCB leverages the outer product of vectors, allowing for multiplicative interactions between elements, which is more expressive than linear methods. The authors propose an efficient approximation of MCB using Count Sketch and Fast Fourier Transform (FFT) to handle high-dimensional representations. Extensive evaluations on VQA and visual grounding datasets show that MCB outperforms existing methods, achieving state-of-the-art results on the Visual7W dataset and the VQA challenge. The paper also discusses the benefits of attention maps and additional training data for VQA tasks.
Reach us at info@study.space