Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

1 Jul 2024 | Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xianglong Liu, Dacheng Tao
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), a novel method for jailbreaking large vision language models (LVLMs) by simultaneously optimizing both visual and textual prompts. Existing jailbreak attacks primarily focus on perturbing the visual modality, but they are ineffective against LVLMs that integrate visual and textual features for generation. BAP addresses this limitation by adversarially embedding universal perturbations in images and optimizing textual prompts to induce harmful responses. The method uses a few-shot query-agnostic corpus to generate adversarial images and employs chain-of-thought reasoning to refine textual prompts. The BAP framework is evaluated on various datasets and LVLMs, demonstrating significant improvements in attack success rate (+29.03% on average) compared to existing methods. The approach is also effective against commercial LVLMs such as Gemini and ChatGLM. The paper highlights the potential of BAP to generate biased outputs and evaluate model robustness. The method is designed to be universal, capable of attacking LVLMs without requiring specific scenario samples. The results show that BAP achieves high and stable performance in both white-box and black-box settings, demonstrating its effectiveness in bypassing model guardrails and generating harmful content. The paper also discusses the implications of BAP for evaluating bias and adversarial robustness in LVLMs.This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), a novel method for jailbreaking large vision language models (LVLMs) by simultaneously optimizing both visual and textual prompts. Existing jailbreak attacks primarily focus on perturbing the visual modality, but they are ineffective against LVLMs that integrate visual and textual features for generation. BAP addresses this limitation by adversarially embedding universal perturbations in images and optimizing textual prompts to induce harmful responses. The method uses a few-shot query-agnostic corpus to generate adversarial images and employs chain-of-thought reasoning to refine textual prompts. The BAP framework is evaluated on various datasets and LVLMs, demonstrating significant improvements in attack success rate (+29.03% on average) compared to existing methods. The approach is also effective against commercial LVLMs such as Gemini and ChatGLM. The paper highlights the potential of BAP to generate biased outputs and evaluate model robustness. The method is designed to be universal, capable of attacking LVLMs without requiring specific scenario samples. The results show that BAP achieves high and stable performance in both white-box and black-box settings, demonstrating its effectiveness in bypassing model guardrails and generating harmful content. The paper also discusses the implications of BAP for evaluating bias and adversarial robustness in LVLMs.
Reach us at info@study.space