[slides] Visual-RolePlay%3A Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Characte

This paper introduces Visual Role-play (VRP), a novel structure-based jailbreak attack method for Multimodal Large Language Models (MLLMs). VRP leverages the concept of role-play to generate high-risk character images and malicious queries that mislead MLLMs into generating harmful responses. The method involves using Large Language Models (LLMs) to generate detailed descriptions of high-risk characters, which are then used to create corresponding images. These images are combined with benign role-play instruction texts to form a complete jailbreak input. VRP is extended to a universal setting, demonstrating strong generalization across various models and scenarios. Extensive experiments on benchmark datasets show that VRP outperforms existing methods like Query Relevant and FigStep by an average Attack Success Rate (ASR) margin of 14.3%. VRP is effective against both system prompt-based defense and the Eye Closed Safety On (ECSO) approach. The method is also integrated with baseline techniques, enhancing their effectiveness. The paper highlights the importance of character images and description typography in improving jailbreak performance. VRP demonstrates strong performance across various models and is effective in bypassing defenses. However, the method may be less effective against poorly performing MLLMs due to limitations in instruction-following and image understanding capabilities. Future work includes exploring more sophisticated strategies for generating characters and improving the quality of character images generated by LLMs and diffusion models.This paper introduces Visual Role-play (VRP), a novel structure-based jailbreak attack method for Multimodal Large Language Models (MLLMs). VRP leverages the concept of role-play to generate high-risk character images and malicious queries that mislead MLLMs into generating harmful responses. The method involves using Large Language Models (LLMs) to generate detailed descriptions of high-risk characters, which are then used to create corresponding images. These images are combined with benign role-play instruction texts to form a complete jailbreak input. VRP is extended to a universal setting, demonstrating strong generalization across various models and scenarios. Extensive experiments on benchmark datasets show that VRP outperforms existing methods like Query Relevant and FigStep by an average Attack Success Rate (ASR) margin of 14.3%. VRP is effective against both system prompt-based defense and the Eye Closed Safety On (ECSO) approach. The method is also integrated with baseline techniques, enhancing their effectiveness. The paper highlights the importance of character images and description typography in improving jailbreak performance. VRP demonstrates strong performance across various models and is effective in bypassing defenses. However, the method may be less effective against poorly performing MLLMs due to limitations in instruction-following and image understanding capabilities. Future work includes exploring more sophisticated strategies for generating characters and improving the quality of character images generated by LLMs and diffusion models.

Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character

12 Jun 2024 | Siyuan Ma, Weidi Luo, Yu Wang, Xiaogeng Liu