Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

1 Jul 2024 | Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xianglong Liu, Dacheng Tao
The paper introduces the Bi-Modal Adversarial Prompt Attack (BAP) to enhance jailbreak attacks on large vision language models (LVMs). Traditional jailbreaks primarily focus on perturbing visual inputs, which are ineffective against aligned models that integrate both visual and textual features. BAP addresses this limitation by optimizing both visual and textual prompts cohesively. The method involves adversarially embedding universal perturbations in images guided by a query-agnostic corpus and then optimizing textual prompts to induce specific harmful content. The effectiveness of BAP is demonstrated through extensive evaluations on various datasets and LVMs, showing a significant improvement over existing methods (a 29.03% increase in attack success rate on average). The paper also evaluates the potential of BAP in black-box commercial LVMs and discusses its applications in bias evaluation and adversarial robustness testing. The main contributions include the introduction of BAP, its detailed implementation, and its superior performance in both white-box and black-box settings.The paper introduces the Bi-Modal Adversarial Prompt Attack (BAP) to enhance jailbreak attacks on large vision language models (LVMs). Traditional jailbreaks primarily focus on perturbing visual inputs, which are ineffective against aligned models that integrate both visual and textual features. BAP addresses this limitation by optimizing both visual and textual prompts cohesively. The method involves adversarially embedding universal perturbations in images guided by a query-agnostic corpus and then optimizing textual prompts to induce specific harmful content. The effectiveness of BAP is demonstrated through extensive evaluations on various datasets and LVMs, showing a significant improvement over existing methods (a 29.03% increase in attack success rate on average). The paper also evaluates the potential of BAP in black-box commercial LVMs and discusses its applications in bias evaluation and adversarial robustness testing. The main contributions include the introduction of BAP, its detailed implementation, and its superior performance in both white-box and black-box settings.
Reach us at info@study.space