Adversarial Attacks on Multimodal Agents

Adversarial Attacks on Multimodal Agents

18 Jun 2024 | Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan
This paper presents adversarial attacks on multimodal agents, which are capable of performing complex tasks in real-world environments. The authors demonstrate that even though attacking these agents is more challenging than traditional attacks due to limited access to the environment, they can still be manipulated using adversarial text strings to perturb a single trigger image. Two types of attacks are described: a captioner attack that targets white-box captioners and a CLIP attack that targets a set of CLIP models, which can transfer to proprietary VLMs. The attacks are evaluated using a set of adversarial tasks called VisualWebArena-Adv, which is based on the VisualWebArena environment. The captioner attack achieves a 75% success rate in making a captioner-augmented GPT-4V agent execute adversarial goals, while the CLIP attack achieves 21% and 43% success rates when the captioner is removed or when GPT-4V generates its own captions. The results show that the attacks are effective against various VLMs, including Gemini-1.5, Claude-3, and GPT-4o. The paper also discusses the implications of these attacks for future defenses, including the importance of consistency checks, instruction hierarchy, and benchmarking attack performance alongside benign performance. The authors conclude that the vulnerabilities in multimodal agents are significant and that future research should focus on improving the robustness of these systems.This paper presents adversarial attacks on multimodal agents, which are capable of performing complex tasks in real-world environments. The authors demonstrate that even though attacking these agents is more challenging than traditional attacks due to limited access to the environment, they can still be manipulated using adversarial text strings to perturb a single trigger image. Two types of attacks are described: a captioner attack that targets white-box captioners and a CLIP attack that targets a set of CLIP models, which can transfer to proprietary VLMs. The attacks are evaluated using a set of adversarial tasks called VisualWebArena-Adv, which is based on the VisualWebArena environment. The captioner attack achieves a 75% success rate in making a captioner-augmented GPT-4V agent execute adversarial goals, while the CLIP attack achieves 21% and 43% success rates when the captioner is removed or when GPT-4V generates its own captions. The results show that the attacks are effective against various VLMs, including Gemini-1.5, Claude-3, and GPT-4o. The paper also discusses the implications of these attacks for future defenses, including the importance of consistency checks, instruction hierarchy, and benchmarking attack performance alongside benign performance. The authors conclude that the vulnerabilities in multimodal agents are significant and that future research should focus on improving the robustness of these systems.
Reach us at info@study.space
Understanding Dissecting Adversarial Robustness of Multimodal LM Agents