18 July 2024 | Xiaoyu Song, Dezhi Han, Chongqing Chen, Xiang Shen, Huafeng Wu
The paper introduces the Visual-Modified Attention Network (VMAN), a novel architecture designed to address the challenges of multimodal Vision-Language Tasks (VLT) such as Visual Question Answering (VQA) and Visual Grounding (VG). VMAN optimizes the attention mechanism in the Transformer model by introducing a visual-modified attention unit that establishes text-visual correspondence before the self-interaction of image information. This unit refines query features, filters out noise, and enhances dependency modeling and reasoning capabilities. Two modified approaches—weighted sum-based and cross-attention-based—are also proposed. Extensive experiments on five benchmark datasets for VQA and VG demonstrate that VMAN achieves competitive performance, with an accuracy of 70.99% on VQA-v2 and a breakthrough of 74.41% on RefCOCOq. The paper highlights the shortcomings of conventional Transformers in multimodal tasks and proposes VMAN as an end-to-end universal architecture, validated through extensive experiments.The paper introduces the Visual-Modified Attention Network (VMAN), a novel architecture designed to address the challenges of multimodal Vision-Language Tasks (VLT) such as Visual Question Answering (VQA) and Visual Grounding (VG). VMAN optimizes the attention mechanism in the Transformer model by introducing a visual-modified attention unit that establishes text-visual correspondence before the self-interaction of image information. This unit refines query features, filters out noise, and enhances dependency modeling and reasoning capabilities. Two modified approaches—weighted sum-based and cross-attention-based—are also proposed. Extensive experiments on five benchmark datasets for VQA and VG demonstrate that VMAN achieves competitive performance, with an accuracy of 70.99% on VQA-v2 and a breakthrough of 74.41% on RefCOCOq. The paper highlights the shortcomings of conventional Transformers in multimodal tasks and proposes VMAN as an end-to-end universal architecture, validated through extensive experiments.