19 Oct 2018 | Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang
This paper introduces Bilinear Attention Networks (BAN), a novel approach for multimodal learning that efficiently utilizes vision-language information through bilinear attention distributions. Traditional attention mechanisms, such as co-attention, often neglect the interaction between modalities, leading to suboptimal performance. BAN addresses this by considering bilinear interactions between input channels, allowing for more effective and seamless integration of visual and textual information. The method uses low-rank bilinear pooling to extract joint representations and proposes a variant of multimodal residual networks (MRN) to efficiently utilize multiple bilinear attention maps. The model is evaluated on the VQA 2.0 and Flickr30k Entities datasets, where it achieves state-of-the-art results, demonstrating superior performance in both visual question answering and visual grounding tasks. Additionally, BAN shows improved inference speed and parameter efficiency, making it a promising solution for multimodal tasks.This paper introduces Bilinear Attention Networks (BAN), a novel approach for multimodal learning that efficiently utilizes vision-language information through bilinear attention distributions. Traditional attention mechanisms, such as co-attention, often neglect the interaction between modalities, leading to suboptimal performance. BAN addresses this by considering bilinear interactions between input channels, allowing for more effective and seamless integration of visual and textual information. The method uses low-rank bilinear pooling to extract joint representations and proposes a variant of multimodal residual networks (MRN) to efficiently utilize multiple bilinear attention maps. The model is evaluated on the VQA 2.0 and Flickr30k Entities datasets, where it achieves state-of-the-art results, demonstrating superior performance in both visual question answering and visual grounding tasks. Additionally, BAN shows improved inference speed and parameter efficiency, making it a promising solution for multimodal tasks.