Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character

Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character

12 Jun 2024 | Siyuan Ma, Weidi Luo, Yu Wang, Xiaogeng Liu
The paper introduces a novel jailbreak attack method called Visual Role-play (VRP) for Multimodal Large Language Models (MLLMs). VRP leverages Large Language Models (LLMs) to generate detailed descriptions of high-risk characters and create corresponding images. These images, when paired with benign role-play instruction texts, effectively mislead MLLMs into generating malicious responses by enacting characters with negative attributes. The method is designed to enhance the effectiveness and generalizability of jailbreak attacks, addressing the limitations of existing structure-based methods. Extensive experiments on popular benchmarks show that VRP outperforms strong baselines, such as Query relevant and FigStep, by an average Attack Success Rate (ASR) margin of 14.3%. The paper also discusses the universal setup of VRP, demonstrating its ability to handle a wide range of malicious queries. The main contributions include the introduction of VRP, its effectiveness in jailbreak attacks, and its strong generalization capabilities. The paper concludes with a discussion on limitations and future work, emphasizing the need for more sophisticated character generation strategies and iterative improvement of character image quality.The paper introduces a novel jailbreak attack method called Visual Role-play (VRP) for Multimodal Large Language Models (MLLMs). VRP leverages Large Language Models (LLMs) to generate detailed descriptions of high-risk characters and create corresponding images. These images, when paired with benign role-play instruction texts, effectively mislead MLLMs into generating malicious responses by enacting characters with negative attributes. The method is designed to enhance the effectiveness and generalizability of jailbreak attacks, addressing the limitations of existing structure-based methods. Extensive experiments on popular benchmarks show that VRP outperforms strong baselines, such as Query relevant and FigStep, by an average Attack Success Rate (ASR) margin of 14.3%. The paper also discusses the universal setup of VRP, demonstrating its ability to handle a wide range of malicious queries. The main contributions include the introduction of VRP, its effectiveness in jailbreak attacks, and its strong generalization capabilities. The paper concludes with a discussion on limitations and future work, emphasizing the need for more sophisticated character generation strategies and iterative improvement of character image quality.
Reach us at info@study.space