Vman: visual-modified attention network for multimodal paradigms

Vman: visual-modified attention network for multimodal paradigms

Accepted: 28 June 2024 / Published online: 18 July 2024 | Xiaoyu Song1 · Dezhi Han1 · Chongqing Chen1 · Xiang Shen1 · Huafeng Wu2
The paper introduces the Visual-Modified Attention Network (VMAN), a novel architecture designed to address the challenges of multimodal Vision-Language Tasks (VLT) such as Visual Question Answering (VQA) and Visual Grounding (VG). VMAN optimizes the attention mechanism in the Transformer model by introducing a visual-modified attention unit that establishes text-visual correspondence before the self-interaction of image information. This unit refines query features, filters out noise, and enhances dependency modeling and reasoning capabilities. Two modified approaches—weighted sum-based and cross-attention-based—are also proposed. Extensive experiments on five benchmark datasets for VQA and VG demonstrate that VMAN achieves competitive performance, with an accuracy of 70.99% on VQA-v2 and a breakthrough of 74.41% on RefCOCOq. The paper highlights the shortcomings of conventional Transformers in multimodal tasks and proposes VMAN as an end-to-end universal architecture, validated through extensive experiments.The paper introduces the Visual-Modified Attention Network (VMAN), a novel architecture designed to address the challenges of multimodal Vision-Language Tasks (VLT) such as Visual Question Answering (VQA) and Visual Grounding (VG). VMAN optimizes the attention mechanism in the Transformer model by introducing a visual-modified attention unit that establishes text-visual correspondence before the self-interaction of image information. This unit refines query features, filters out noise, and enhances dependency modeling and reasoning capabilities. Two modified approaches—weighted sum-based and cross-attention-based—are also proposed. Extensive experiments on five benchmark datasets for VQA and VG demonstrate that VMAN achieves competitive performance, with an accuracy of 70.99% on VQA-v2 and a breakthrough of 74.41% on RefCOCOq. The paper highlights the shortcomings of conventional Transformers in multimodal tasks and proposes VMAN as an end-to-end universal architecture, validated through extensive experiments.
Reach us at info@study.space