28 May 2024 | Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun ji, Guangnan Ye, Yu-Gang Jiang
This paper introduces a text-image multimodal attack strategy against Large Vision-Language Models (VLMs), aiming to exploit their vulnerabilities by jointly attacking both text and image modalities. The proposed method, called the Universal Master Key (UMK), consists of an adversarial image prefix and an adversarial text suffix. The adversarial image prefix is optimized to generate harmful content without text input, while the adversarial text suffix is co-optimized with the image prefix to maximize the probability of eliciting affirmative responses to harmful instructions. The UMK can be integrated into malicious queries to bypass the alignment defenses of VLMs and generate objectionable content, known as jailbreaks.
The method is evaluated on benchmark datasets, demonstrating that the UMK can effectively jailbreak MiniGPT-4 with a 96% success rate, highlighting the vulnerability of VLMs and the urgent need for new alignment strategies. The proposed strategy addresses the limitations of previous unimodal attacks by employing dual optimization objectives, which guide the model to generate affirmative responses with high toxicity. This approach not only increases the toxicity of generated responses but also improves the model's adherence to instructions.
The experiments show that the proposed method outperforms existing unimodal attacks in terms of attack success rate and toxicity. The results indicate that the dual optimization objective strategy is effective in enhancing the toxicity of adversarial examples while maintaining the model's ability to follow instructions. The method also demonstrates superior performance in various categories of harmful instructions, including identity attack, disinformation, and x-risk. The results highlight the effectiveness of the text-image multimodal attack strategy in exploiting the vulnerabilities of VLMs.This paper introduces a text-image multimodal attack strategy against Large Vision-Language Models (VLMs), aiming to exploit their vulnerabilities by jointly attacking both text and image modalities. The proposed method, called the Universal Master Key (UMK), consists of an adversarial image prefix and an adversarial text suffix. The adversarial image prefix is optimized to generate harmful content without text input, while the adversarial text suffix is co-optimized with the image prefix to maximize the probability of eliciting affirmative responses to harmful instructions. The UMK can be integrated into malicious queries to bypass the alignment defenses of VLMs and generate objectionable content, known as jailbreaks.
The method is evaluated on benchmark datasets, demonstrating that the UMK can effectively jailbreak MiniGPT-4 with a 96% success rate, highlighting the vulnerability of VLMs and the urgent need for new alignment strategies. The proposed strategy addresses the limitations of previous unimodal attacks by employing dual optimization objectives, which guide the model to generate affirmative responses with high toxicity. This approach not only increases the toxicity of generated responses but also improves the model's adherence to instructions.
The experiments show that the proposed method outperforms existing unimodal attacks in terms of attack success rate and toxicity. The results indicate that the dual optimization objective strategy is effective in enhancing the toxicity of adversarial examples while maintaining the model's ability to follow instructions. The method also demonstrates superior performance in various categories of harmful instructions, including identity attack, disinformation, and x-risk. The results highlight the effectiveness of the text-image multimodal attack strategy in exploiting the vulnerabilities of VLMs.