ADVERSARIAL ROBUSTNESS FOR VISUAL GROUNDING OF MULTIMODAL LARGE LANGUAGE MODELS

ADVERSARIAL ROBUSTNESS FOR VISUAL GROUNDING OF MULTIMODAL LARGE LANGUAGE MODELS

16 May 2024 | Kuofeng Gao1, Yang Bai2, Jiawang Bai1, Yong Yang2†, Shu-Tao Xia1,3†
This paper addresses the adversarial robustness of visual grounding in Multi-modal Large Language Models (MLLMs). Visual grounding, a critical capability for MLLMs, involves recognizing and localizing objects in images based on textual prompts. The authors propose three adversarial attack paradigms: untargeted attacks that reduce the accuracy of bounding box predictions, exclusive targeted attacks that force all objects to a single target bounding box, and permuted targeted attacks that rearrange bounding boxes within an image. Extensive experiments on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate the effectiveness of these attacks, highlighting the need for improved adversarial robustness in MLLMs. The study not only provides a new perspective on designing novel attacks but also serves as a strong baseline for future research in this area.This paper addresses the adversarial robustness of visual grounding in Multi-modal Large Language Models (MLLMs). Visual grounding, a critical capability for MLLMs, involves recognizing and localizing objects in images based on textual prompts. The authors propose three adversarial attack paradigms: untargeted attacks that reduce the accuracy of bounding box predictions, exclusive targeted attacks that force all objects to a single target bounding box, and permuted targeted attacks that rearrange bounding boxes within an image. Extensive experiments on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate the effectiveness of these attacks, highlighting the need for improved adversarial robustness in MLLMs. The study not only provides a new perspective on designing novel attacks but also serves as a strong baseline for future research in this area.
Reach us at info@study.space
Understanding Adversarial Robustness for Visual Grounding of Multimodal Large Language Models