Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

16 May 2024 | Kuofeng Gao, Yang Bai, Jiawang Bai, Yong Yang, Shu-Tao Xia
This paper investigates the adversarial robustness of visual grounding in multimodal large language models (MLLMs). Visual grounding enables MLLMs to recognize and locate objects in images, generating bounding boxes for tasks like referring expression comprehension (REC). However, the adversarial robustness of this capability in MLLMs remains unexplored. To address this, the authors propose three adversarial attack paradigms: untargeted, exclusive targeted, and permuted targeted attacks. Untargeted attacks aim to reduce the accuracy of bounding box predictions, while exclusive targeted attacks force MLLMs to generate the same target bounding box for all objects, and permuted targeted attacks rearrange bounding boxes among different objects in an image. These attacks are designed to evaluate how vulnerable MLLMs are to adversarial perturbations in visual grounding tasks. The authors conduct extensive experiments on three benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg) using the 7B version of MiniGPT-v2. Results show that the proposed attacks significantly degrade the performance of MLLMs in visual grounding tasks, with permuted targeted attacks being the most challenging. The findings highlight the need for improved adversarial robustness in visual grounding for MLLMs. The paper also discusses related work, including recent advancements in MLLMs and existing adversarial attacks, emphasizing the importance of studying adversarial threats in visual grounding. The authors conclude that their proposed attacks provide a new perspective for evaluating the robustness of MLLMs and serve as a baseline for future research in this area. The work also includes an ethics statement, noting that experiments are conducted in a controlled environment and that the attacks are not supported in real-world scenarios.This paper investigates the adversarial robustness of visual grounding in multimodal large language models (MLLMs). Visual grounding enables MLLMs to recognize and locate objects in images, generating bounding boxes for tasks like referring expression comprehension (REC). However, the adversarial robustness of this capability in MLLMs remains unexplored. To address this, the authors propose three adversarial attack paradigms: untargeted, exclusive targeted, and permuted targeted attacks. Untargeted attacks aim to reduce the accuracy of bounding box predictions, while exclusive targeted attacks force MLLMs to generate the same target bounding box for all objects, and permuted targeted attacks rearrange bounding boxes among different objects in an image. These attacks are designed to evaluate how vulnerable MLLMs are to adversarial perturbations in visual grounding tasks. The authors conduct extensive experiments on three benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg) using the 7B version of MiniGPT-v2. Results show that the proposed attacks significantly degrade the performance of MLLMs in visual grounding tasks, with permuted targeted attacks being the most challenging. The findings highlight the need for improved adversarial robustness in visual grounding for MLLMs. The paper also discusses related work, including recent advancements in MLLMs and existing adversarial attacks, emphasizing the importance of studying adversarial threats in visual grounding. The authors conclude that their proposed attacks provide a new perspective for evaluating the robustness of MLLMs and serve as a baseline for future research in this area. The work also includes an ethics statement, noting that experiments are conducted in a controlled environment and that the attacks are not supported in real-world scenarios.
Reach us at info@study.space