This paper conducts a rigorous evaluation of GPT-4o, a recently released multimodal large language model (MLLM), against jailbreak attacks. The study aims to assess the safety of GPT-4o, particularly in the context of its potential societal impact. The evaluation involves a series of multi-modal and uni-modal jailbreak attacks on four commonly used benchmarks, covering text, speech, and image modalities. Over 4,000 initial text queries and nearly 8,000 responses are analyzed and statistically evaluated.
Key findings include:
1. **Enhanced Safety in Text Modality**: Compared to previous versions like GPT-4V, GPT-4o shows improved safety in text-based jailbreak attacks.
2. **New Attack Vectors in Audio Modality**: The introduction of the audio modality opens new attack vectors for jailbreak attacks on GPT-4o.
3. **Ineffectiveness of Existing Methods**: Current black-box multimodal jailbreak attack methods are largely ineffective against GPT-4o and GPT-4V, though GPT-4o is less safe than GPT-4V at the multimodal level.
4. **Transferability of Text Adversarial Prompts**: Text adversarial prompts from other LLMs can still successfully jailbreak GPT-4o, indicating the need for robust alignment guardrails.
The study highlights the importance of addressing safety concerns in large models and provides critical insights into the safety implications of GPT-4o. The findings suggest that while GPT-4o has enhanced safety in text-based attacks, it remains vulnerable to audio and multimodal attacks, emphasizing the need for ongoing research and development in alignment strategies and mitigation techniques.This paper conducts a rigorous evaluation of GPT-4o, a recently released multimodal large language model (MLLM), against jailbreak attacks. The study aims to assess the safety of GPT-4o, particularly in the context of its potential societal impact. The evaluation involves a series of multi-modal and uni-modal jailbreak attacks on four commonly used benchmarks, covering text, speech, and image modalities. Over 4,000 initial text queries and nearly 8,000 responses are analyzed and statistically evaluated.
Key findings include:
1. **Enhanced Safety in Text Modality**: Compared to previous versions like GPT-4V, GPT-4o shows improved safety in text-based jailbreak attacks.
2. **New Attack Vectors in Audio Modality**: The introduction of the audio modality opens new attack vectors for jailbreak attacks on GPT-4o.
3. **Ineffectiveness of Existing Methods**: Current black-box multimodal jailbreak attack methods are largely ineffective against GPT-4o and GPT-4V, though GPT-4o is less safe than GPT-4V at the multimodal level.
4. **Transferability of Text Adversarial Prompts**: Text adversarial prompts from other LLMs can still successfully jailbreak GPT-4o, indicating the need for robust alignment guardrails.
The study highlights the importance of addressing safety concerns in large models and provides critical insights into the safety implications of GPT-4o. The findings suggest that while GPT-4o has enhanced safety in text-based attacks, it remains vulnerable to audio and multimodal attacks, emphasizing the need for ongoing research and development in alignment strategies and mitigation techniques.