[slides and audio] Arondight%3A Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts

Arondight is a standardized red teaming framework designed specifically for Large Vision Language Models (VLMs) to address security and ethical concerns, particularly the generation of harmful content. The framework addresses the lack of well-developed red teaming methodologies for VLMs by introducing an automated multi-modal jailbreak attack. This attack combines visual and textual prompts, where visual prompts are generated by a red team VLM and textual prompts by a red team LLM guided by a reinforcement learning (RL) agent. To enhance the comprehensiveness of security evaluations, Arondight integrates entropy bonuses and novelty reward metrics to encourage the RL agent to generate diverse and previously unseen test cases. The evaluation of ten cutting-edge VLMs using Arondight reveals significant security vulnerabilities, especially in generating toxic images and aligning multi-modal prompts. The framework achieves an average attack success rate of 84.5% on GPT-4 in all fourteen prohibited scenarios defined by OpenAI. The paper also categorizes existing VLMs based on their safety levels and provides corresponding reinforcement recommendations. The multimodal prompt dataset and red team code will be released after ethics committee approval. Key contributions of the paper include: 1. The proposal of Arondight, a red teaming framework for VLMs. 2. The design of an auto-generated multi-modal jailbreak attack strategy. 3. Extensive experiments on ten VLMs and classification of their safety levels. The findings highlight the importance of comprehensive security testing and the need for continuous improvement in VLMs to address alignment issues and enhance safety mechanisms.Arondight is a standardized red teaming framework designed specifically for Large Vision Language Models (VLMs) to address security and ethical concerns, particularly the generation of harmful content. The framework addresses the lack of well-developed red teaming methodologies for VLMs by introducing an automated multi-modal jailbreak attack. This attack combines visual and textual prompts, where visual prompts are generated by a red team VLM and textual prompts by a red team LLM guided by a reinforcement learning (RL) agent. To enhance the comprehensiveness of security evaluations, Arondight integrates entropy bonuses and novelty reward metrics to encourage the RL agent to generate diverse and previously unseen test cases. The evaluation of ten cutting-edge VLMs using Arondight reveals significant security vulnerabilities, especially in generating toxic images and aligning multi-modal prompts. The framework achieves an average attack success rate of 84.5% on GPT-4 in all fourteen prohibited scenarios defined by OpenAI. The paper also categorizes existing VLMs based on their safety levels and provides corresponding reinforcement recommendations. The multimodal prompt dataset and red team code will be released after ethics committee approval. Key contributions of the paper include: 1. The proposal of Arondight, a red teaming framework for VLMs. 2. The design of an auto-generated multi-modal jailbreak attack strategy. 3. Extensive experiments on ten VLMs and classification of their safety levels. The findings highlight the importance of comprehensive security testing and the need for continuous improvement in VLMs to address alignment issues and enhance safety mechanisms.

Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts

2024 | Yi Liu, Chengjun Cai, Xiaoli Zhang, Xingliang Yuan, Cong Wang