RED TEAMING GPT-4V: ARE GPT-4V SAFE AGAINST UNI/MULTI-MODAL JAILBREAK ATTACKS?

RED TEAMING GPT-4V: ARE GPT-4V SAFE AGAINST UNI/MULTI-MODAL JAILBREAK ATTACKS?

4 Apr 2024 | Shuo Chen, Zhen Han, Bailan He, Zifeng Ding, Wenqian Yu, Philip Torr, Volker Tresp, Jindong Gu
This paper investigates the robustness of GPT-4 and GPT-4V against jailbreak attacks on both text and visual modalities. The authors construct a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. They conduct extensive red-teaming experiments on 11 different LLMs and MLLMs, including both proprietary and open-source models. The results show that GPT-4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source models. Among open-source models, Llama2 and Qwen-VL-Chat show better robustness, with Llama2 being more robust than GPT-4. Visual jailbreak methods have relatively limited transferability compared to textual methods. The study also finds that AutoDAN has better transferability than GCG. The results indicate that GPT-4 and GPT-4V are significantly more robust than open-source models, especially against visual jailbreak attacks. The study highlights the importance of safety alignment and fine-tuning in improving model robustness against jailbreak attacks. The authors also find that the current defense mechanisms of open-source models are not as effective as those of closed-source models. The study concludes that while GPT-4 and GPT-4V are more robust, they are not completely immune to jailbreak attacks. The results suggest that future research should focus on improving the robustness of open-source models and developing more effective defense mechanisms against jailbreak attacks.This paper investigates the robustness of GPT-4 and GPT-4V against jailbreak attacks on both text and visual modalities. The authors construct a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. They conduct extensive red-teaming experiments on 11 different LLMs and MLLMs, including both proprietary and open-source models. The results show that GPT-4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source models. Among open-source models, Llama2 and Qwen-VL-Chat show better robustness, with Llama2 being more robust than GPT-4. Visual jailbreak methods have relatively limited transferability compared to textual methods. The study also finds that AutoDAN has better transferability than GCG. The results indicate that GPT-4 and GPT-4V are significantly more robust than open-source models, especially against visual jailbreak attacks. The study highlights the importance of safety alignment and fine-tuning in improving model robustness against jailbreak attacks. The authors also find that the current defense mechanisms of open-source models are not as effective as those of closed-source models. The study concludes that while GPT-4 and GPT-4V are more robust, they are not completely immune to jailbreak attacks. The results suggest that future research should focus on improving the robustness of open-source models and developing more effective defense mechanisms against jailbreak attacks.
Reach us at info@study.space
[slides and audio] Red Teaming GPT-4V%3A Are GPT-4V Safe Against Uni%2FMulti-Modal Jailbreak Attacks%3F