[slides and audio] AdaShield%3A Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting

**AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting** **Authors:** Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao **Institution:** Peking University, University of Wisconsin–Madison, International Digital Economy Academy, University of California, Davis **Abstract:** The integration of additional modalities in Multimodal Large Language Models (MLLMs) has exposed them to new vulnerabilities, particularly structure-based jailbreak attacks, where harmful content is injected into images to mislead MLLMs. This work introduces AdaShield, a novel defense mechanism that prepends inputs with defense prompts to protect MLLMs from such attacks without fine-tuning or training additional modules. AdaShield consists of a static defense prompt and an adaptive auto-refinement framework. The static defense prompt, manually designed, thoroughly examines image and instruction content and specifies response methods for malicious queries. The adaptive auto-refinement framework, comprising a target MLLM and a large language model-based defense prompt generator (Defender), collaboratively optimizes defense prompts through iterative communication. Extensive experiments on popular structure-based jailbreak attacks and benign datasets demonstrate that AdaShield consistently improves MLLMs' robustness against such attacks without compromising their general capabilities. **Keywords:** Multimodal Large Language Models, Safety, Defense Strategy, Prompt-based Learning **Contributions:** 1. Introduce AdaShield, a novel defense framework that automatically and adaptively prepends defense prompts to model inputs. 2. Develop an auto-refinement framework that employs a target MLLM and a defender to iteratively optimize defense prompts, enhancing robustness and prompt diversity. 3. Show superior performance in defending against structure-based jailbreak attacks while maintaining model performance on benign datasets. **Related Work:** - **Jailbreak Attacks on MLLMs:** Categorized into perturbation-based and structure-based attacks. - **Defense on MLLMs:** Inference-time and training-time alignments, with recent works focusing on post-hoc filtering defenses. **Methodology:** - **Preliminary:** Define defense tasks and task definition. - **AdaShield-S (Manual Static Defense Prompt):** Design effective defense prompts manually. - **AdaShield-A (Defense Prompt Auto-Refinement Framework):** Automatically optimize defense prompts using a defender LLM and a target MLLM. **Experiments:** - **Setup:** Use FigStep and QR attacks, MM-Vet benchmark, and various baselines. - **Results:** Show AdaShield outperforms baselines in defense effectiveness and generalizes well to unseen scenarios. **Conclusion & Limitation:** - AdaShield is effective in safeguarding MLLMs from structure-based jailbreak attacks without fine-tuning or additional modules. - Future work includes developing a universal defense framework for both structure-based and perturbation-based attacks.**AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting** **Authors:** Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao **Institution:** Peking University, University of Wisconsin–Madison, International Digital Economy Academy, University of California, Davis **Abstract:** The integration of additional modalities in Multimodal Large Language Models (MLLMs) has exposed them to new vulnerabilities, particularly structure-based jailbreak attacks, where harmful content is injected into images to mislead MLLMs. This work introduces AdaShield, a novel defense mechanism that prepends inputs with defense prompts to protect MLLMs from such attacks without fine-tuning or training additional modules. AdaShield consists of a static defense prompt and an adaptive auto-refinement framework. The static defense prompt, manually designed, thoroughly examines image and instruction content and specifies response methods for malicious queries. The adaptive auto-refinement framework, comprising a target MLLM and a large language model-based defense prompt generator (Defender), collaboratively optimizes defense prompts through iterative communication. Extensive experiments on popular structure-based jailbreak attacks and benign datasets demonstrate that AdaShield consistently improves MLLMs' robustness against such attacks without compromising their general capabilities. **Keywords:** Multimodal Large Language Models, Safety, Defense Strategy, Prompt-based Learning **Contributions:** 1. Introduce AdaShield, a novel defense framework that automatically and adaptively prepends defense prompts to model inputs. 2. Develop an auto-refinement framework that employs a target MLLM and a defender to iteratively optimize defense prompts, enhancing robustness and prompt diversity. 3. Show superior performance in defending against structure-based jailbreak attacks while maintaining model performance on benign datasets. **Related Work:** - **Jailbreak Attacks on MLLMs:** Categorized into perturbation-based and structure-based attacks. - **Defense on MLLMs:** Inference-time and training-time alignments, with recent works focusing on post-hoc filtering defenses. **Methodology:** - **Preliminary:** Define defense tasks and task definition. - **AdaShield-S (Manual Static Defense Prompt):** Design effective defense prompts manually. - **AdaShield-A (Defense Prompt Auto-Refinement Framework):** Automatically optimize defense prompts using a defender LLM and a target MLLM. **Experiments:** - **Setup:** Use FigStep and QR attacks, MM-Vet benchmark, and various baselines. - **Results:** Show AdaShield outperforms baselines in defense effectiveness and generalizes well to unseen scenarios. **Conclusion & Limitation:** - AdaShield is effective in safeguarding MLLMs from structure-based jailbreak attacks without fine-tuning or additional modules. - Future work includes developing a universal defense framework for both structure-based and perturbation-based attacks.

AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting

14 Mar 2024 | Yu Wang* 1,2 Xiaogeng Liu* 2 Yu Li 3 Muhao Chen 4 Chaowei Xiao 2