18 Jun 2024 | Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan
This paper explores the security risks of multimodal agents, which are autonomous systems capable of interacting with real environments. The authors demonstrate that these agents are vulnerable to adversarial attacks, even when the attacker has limited access to the environment. They introduce two types of attacks: the *captioner attack* and the *CLIP attack*. The captioner attack targets white-box captioners used to process images into captions, while the CLIP attack targets a set of CLIP models, which can be used in proprietary VLMs. The authors evaluate their attacks using VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. The results show that the captioner attack can achieve a 75% success rate in making a captioner-augmented GPT-4V agent execute adversarial goals, while the CLIP attack achieves success rates of 21% and 43% when removing the captioner or using GPT-4V to generate its own captions, respectively. The paper also discusses the implications for defenses and highlights the importance of consistency checks and instruction hierarchies to mitigate these attacks.This paper explores the security risks of multimodal agents, which are autonomous systems capable of interacting with real environments. The authors demonstrate that these agents are vulnerable to adversarial attacks, even when the attacker has limited access to the environment. They introduce two types of attacks: the *captioner attack* and the *CLIP attack*. The captioner attack targets white-box captioners used to process images into captions, while the CLIP attack targets a set of CLIP models, which can be used in proprietary VLMs. The authors evaluate their attacks using VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. The results show that the captioner attack can achieve a 75% success rate in making a captioner-augmented GPT-4V agent execute adversarial goals, while the CLIP attack achieves success rates of 21% and 43% when removing the captioner or using GPT-4V to generate its own captions, respectively. The paper also discusses the implications for defenses and highlights the importance of consistency checks and instruction hierarchies to mitigate these attacks.