Understanding White-box Multimodal Jailbreaks Against Large Vision-Language Models

This paper introduces a novel text-image multimodal attack strategy to exploit vulnerabilities in Large Vision-Language Models (VLMs). Unlike existing unimodal attacks that primarily focus on perturbing images, this work proposes a dual optimization objective to jointly attack both text and image modalities. The attack method involves two stages: first, an adversarial image prefix is optimized to generate harmful responses without text input, infusing toxic semantics into the image. Second, an adversarial text suffix is integrated and co-optimized with the image prefix to maximize the probability of the model generating affirmative responses to malicious queries. This combined approach, referred to as the Universal Master Key (UMK), can circumvent alignment defenses and generate objectionable content. Experimental results on MiniGPT-4 demonstrate a 96% success rate in jailbreaking the model, highlighting the urgent need for new alignment strategies to address these vulnerabilities. The paper also includes ablation studies and qualitative analysis to validate the effectiveness of the proposed method.This paper introduces a novel text-image multimodal attack strategy to exploit vulnerabilities in Large Vision-Language Models (VLMs). Unlike existing unimodal attacks that primarily focus on perturbing images, this work proposes a dual optimization objective to jointly attack both text and image modalities. The attack method involves two stages: first, an adversarial image prefix is optimized to generate harmful responses without text input, infusing toxic semantics into the image. Second, an adversarial text suffix is integrated and co-optimized with the image prefix to maximize the probability of the model generating affirmative responses to malicious queries. This combined approach, referred to as the Universal Master Key (UMK), can circumvent alignment defenses and generate objectionable content. Experimental results on MiniGPT-4 demonstrate a 96% success rate in jailbreaking the model, highlighting the urgent need for new alignment strategies to address these vulnerabilities. The paper also includes ablation studies and qualitative analysis to validate the effectiveness of the proposed method.

White-box Multimodal Jailbreaks Against Large Vision-Language Models

28 May 2024 | Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun ji, Guangnan Ye, Yu-Gang Jiang