14 Mar 2024 | Yu Wang* 1,2 Xiaogeng Liu* 2 Yu Li 3 Muhao Chen 4 Chaowei Xiao 2
**AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting**
**Authors:** Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao
**Institution:** Peking University, University of Wisconsin–Madison, International Digital Economy Academy, University of California, Davis
**Abstract:**
The integration of additional modalities in Multimodal Large Language Models (MLLMs) has exposed them to new vulnerabilities, particularly structure-based jailbreak attacks, where harmful content is injected into images to mislead MLLMs. This work introduces AdaShield, a novel defense mechanism that prepends inputs with defense prompts to protect MLLMs from such attacks without fine-tuning or training additional modules. AdaShield consists of a static defense prompt and an adaptive auto-refinement framework. The static defense prompt, manually designed, thoroughly examines image and instruction content and specifies response methods for malicious queries. The adaptive auto-refinement framework, comprising a target MLLM and a large language model-based defense prompt generator (Defender), collaboratively optimizes defense prompts through iterative communication. Extensive experiments on popular structure-based jailbreak attacks and benign datasets demonstrate that AdaShield consistently improves MLLMs' robustness against such attacks without compromising their general capabilities.
**Keywords:**
Multimodal Large Language Models, Safety, Defense Strategy, Prompt-based Learning
**Contributions:**
1. Introduce AdaShield, a novel defense framework that automatically and adaptively prepends defense prompts to model inputs.
2. Develop an auto-refinement framework that employs a target MLLM and a defender to iteratively optimize defense prompts, enhancing robustness and prompt diversity.
3. Show superior performance in defending against structure-based jailbreak attacks while maintaining model performance on benign datasets.
**Related Work:**
- **Jailbreak Attacks on MLLMs:** Categorized into perturbation-based and structure-based attacks.
- **Defense on MLLMs:** Inference-time and training-time alignments, with recent works focusing on post-hoc filtering defenses.
**Methodology:**
- **Preliminary:** Define defense tasks and task definition.
- **AdaShield-S (Manual Static Defense Prompt):** Design effective defense prompts manually.
- **AdaShield-A (Defense Prompt Auto-Refinement Framework):** Automatically optimize defense prompts using a defender LLM and a target MLLM.
**Experiments:**
- **Setup:** Use FigStep and QR attacks, MM-Vet benchmark, and various baselines.
- **Results:** Show AdaShield outperforms baselines in defense effectiveness and generalizes well to unseen scenarios.
**Conclusion & Limitation:**
- AdaShield is effective in safeguarding MLLMs from structure-based jailbreak attacks without fine-tuning or additional modules.
- Future work includes developing a universal defense framework for both structure-based and perturbation-based attacks.**AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting**
**Authors:** Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao
**Institution:** Peking University, University of Wisconsin–Madison, International Digital Economy Academy, University of California, Davis
**Abstract:**
The integration of additional modalities in Multimodal Large Language Models (MLLMs) has exposed them to new vulnerabilities, particularly structure-based jailbreak attacks, where harmful content is injected into images to mislead MLLMs. This work introduces AdaShield, a novel defense mechanism that prepends inputs with defense prompts to protect MLLMs from such attacks without fine-tuning or training additional modules. AdaShield consists of a static defense prompt and an adaptive auto-refinement framework. The static defense prompt, manually designed, thoroughly examines image and instruction content and specifies response methods for malicious queries. The adaptive auto-refinement framework, comprising a target MLLM and a large language model-based defense prompt generator (Defender), collaboratively optimizes defense prompts through iterative communication. Extensive experiments on popular structure-based jailbreak attacks and benign datasets demonstrate that AdaShield consistently improves MLLMs' robustness against such attacks without compromising their general capabilities.
**Keywords:**
Multimodal Large Language Models, Safety, Defense Strategy, Prompt-based Learning
**Contributions:**
1. Introduce AdaShield, a novel defense framework that automatically and adaptively prepends defense prompts to model inputs.
2. Develop an auto-refinement framework that employs a target MLLM and a defender to iteratively optimize defense prompts, enhancing robustness and prompt diversity.
3. Show superior performance in defending against structure-based jailbreak attacks while maintaining model performance on benign datasets.
**Related Work:**
- **Jailbreak Attacks on MLLMs:** Categorized into perturbation-based and structure-based attacks.
- **Defense on MLLMs:** Inference-time and training-time alignments, with recent works focusing on post-hoc filtering defenses.
**Methodology:**
- **Preliminary:** Define defense tasks and task definition.
- **AdaShield-S (Manual Static Defense Prompt):** Design effective defense prompts manually.
- **AdaShield-A (Defense Prompt Auto-Refinement Framework):** Automatically optimize defense prompts using a defender LLM and a target MLLM.
**Experiments:**
- **Setup:** Use FigStep and QR attacks, MM-Vet benchmark, and various baselines.
- **Results:** Show AdaShield outperforms baselines in defense effectiveness and generalizes well to unseen scenarios.
**Conclusion & Limitation:**
- AdaShield is effective in safeguarding MLLMs from structure-based jailbreak attacks without fine-tuning or additional modules.
- Future work includes developing a universal defense framework for both structure-based and perturbation-based attacks.