Aronight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts

Aronight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts

2024 | Yi Liu, Chengjun Cai, Xiaoli Zhang, Xingliang Yuan, Cong Wang
This paper introduces Arondight, a red teaming framework for Large Vision Language Models (VLMs) that addresses the lack of visual modality and diversity in existing red teaming methods. Arondight employs an automated multi-modal jailbreak attack, where visual jailbreak prompts are generated by a red team VLM and textual prompts by a red team LLM guided by a reinforcement learning (RL) agent. The framework integrates entropy bonuses and novelty reward metrics to enhance the diversity of test cases. The RL agent is optimized to generate diverse and previously unseen test cases, improving the comprehensiveness of VLM security evaluation. The evaluation of ten cutting-edge VLMs reveals significant security vulnerabilities, particularly in generating toxic images and aligning multi-modal prompts. Arondight achieves an average attack success rate of 84.5% on GPT-4 in all fourteen prohibited scenarios defined by OpenAI. The framework also categorizes existing VLMs based on their safety levels and provides corresponding reinforcement recommendations. The multimodal prompt dataset and red team code will be released after ethics committee approval. Arondight's key contributions include: (1) proposing a red team framework for VLMs to comprehensively test their safety performance; (2) designing an auto-generated multi-modal jailbreak attack strategy that covers image and text modalities and achieves diversity generation; and (3) conducting extensive experiments on ten VLMs and classifying them for safety, with a success rate of 84.50% against GPT-4. The framework also identifies potential vulnerabilities in existing VLM alignment mechanisms and provides safety level classifications for VLMs. The results highlight the need for improved safety and alignment mechanisms in VLMs.This paper introduces Arondight, a red teaming framework for Large Vision Language Models (VLMs) that addresses the lack of visual modality and diversity in existing red teaming methods. Arondight employs an automated multi-modal jailbreak attack, where visual jailbreak prompts are generated by a red team VLM and textual prompts by a red team LLM guided by a reinforcement learning (RL) agent. The framework integrates entropy bonuses and novelty reward metrics to enhance the diversity of test cases. The RL agent is optimized to generate diverse and previously unseen test cases, improving the comprehensiveness of VLM security evaluation. The evaluation of ten cutting-edge VLMs reveals significant security vulnerabilities, particularly in generating toxic images and aligning multi-modal prompts. Arondight achieves an average attack success rate of 84.5% on GPT-4 in all fourteen prohibited scenarios defined by OpenAI. The framework also categorizes existing VLMs based on their safety levels and provides corresponding reinforcement recommendations. The multimodal prompt dataset and red team code will be released after ethics committee approval. Arondight's key contributions include: (1) proposing a red team framework for VLMs to comprehensively test their safety performance; (2) designing an auto-generated multi-modal jailbreak attack strategy that covers image and text modalities and achieves diversity generation; and (3) conducting extensive experiments on ten VLMs and classifying them for safety, with a success rate of 84.50% against GPT-4. The framework also identifies potential vulnerabilities in existing VLM alignment mechanisms and provides safety level classifications for VLMs. The results highlight the need for improved safety and alignment mechanisms in VLMs.
Reach us at info@study.space