19 Oct 2018 | Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang
Bilinear Attention Networks (BAN) are introduced to efficiently utilize visual and linguistic information in multimodal learning. Unlike co-attention networks that handle each modality separately, BAN employs bilinear attention maps to capture interactions between multiple input channels. This approach allows for more effective joint representation of visual and textual information, leading to improved performance in tasks like visual question answering (VQA) and visual grounding. BAN integrates low-rank bilinear pooling to extract joint representations and uses a variant of multimodal residual networks (MRN) to efficiently utilize multiple attention maps. The model achieves state-of-the-art results on the VQA 2.0 and Flickr30k Entities datasets, with significant improvements in both accuracy and inference speed. BAN's bilinear attention mechanism enables the model to better capture complex interactions between visual and linguistic features, making it more effective in tasks requiring multimodal reasoning. The model's performance is validated through extensive experiments, demonstrating its superiority over previous methods in both quantitative and qualitative assessments.Bilinear Attention Networks (BAN) are introduced to efficiently utilize visual and linguistic information in multimodal learning. Unlike co-attention networks that handle each modality separately, BAN employs bilinear attention maps to capture interactions between multiple input channels. This approach allows for more effective joint representation of visual and textual information, leading to improved performance in tasks like visual question answering (VQA) and visual grounding. BAN integrates low-rank bilinear pooling to extract joint representations and uses a variant of multimodal residual networks (MRN) to efficiently utilize multiple attention maps. The model achieves state-of-the-art results on the VQA 2.0 and Flickr30k Entities datasets, with significant improvements in both accuracy and inference speed. BAN's bilinear attention mechanism enables the model to better capture complex interactions between visual and linguistic features, making it more effective in tasks requiring multimodal reasoning. The model's performance is validated through extensive experiments, demonstrating its superiority over previous methods in both quantitative and qualitative assessments.