This paper addresses the vulnerability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to jailbreak attacks, which can bypass safety measures and generate harmful content. The authors construct a comprehensive evaluation benchmark with 1445 harmful questions covering 11 different safety policies. They conduct extensive red-teaming experiments on 11 different LLMs and MLLMs, including both proprietary and open-source models. The results show that GPT-4 and GPT-4V demonstrate better robustness against both textual and visual jailbreak methods compared to open-source models. Specifically, Llama2 and Qwen-VL-Chat are more robust among open-source models. The study also finds that visual jailbreak methods have limited transferability compared to textual methods. The paper provides a detailed analysis of the robustness of different models and the effectiveness of various attack methods, contributing to a better understanding of the security of LLMs and MLLMs.This paper addresses the vulnerability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to jailbreak attacks, which can bypass safety measures and generate harmful content. The authors construct a comprehensive evaluation benchmark with 1445 harmful questions covering 11 different safety policies. They conduct extensive red-teaming experiments on 11 different LLMs and MLLMs, including both proprietary and open-source models. The results show that GPT-4 and GPT-4V demonstrate better robustness against both textual and visual jailbreak methods compared to open-source models. Specifically, Llama2 and Qwen-VL-Chat are more robust among open-source models. The study also finds that visual jailbreak methods have limited transferability compared to textual methods. The paper provides a detailed analysis of the robustness of different models and the effectiveness of various attack methods, contributing to a better understanding of the security of LLMs and MLLMs.