AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting

AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting

14 Mar 2024 | Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao
AdaShield is a novel defense mechanism for Multimodal Large Language Models (MLLMs) against structure-based jailbreak attacks. It employs adaptive shield prompting to enhance the robustness of MLLMs without the need for fine-tuning or additional modules. The method introduces a static defense prompt, AdaShield-S, which manually examines image and instruction content to detect harmful queries. However, it is limited in complex scenarios. To address this, AdaShield-A is proposed, an adaptive auto-refinement framework that iteratively optimizes defense prompts using a target MLLM and a defender LLM. This framework generates a diverse pool of defense prompts tailored to various attack scenarios, improving the effectiveness of MLLMs against structure-based jailbreak attacks without compromising their general capabilities on benign tasks. Extensive experiments show that AdaShield-A outperforms existing methods in defending against structure-based attacks while maintaining performance on standard benign datasets. The method is efficient, with minimal additional inference time costs, and is applicable to black-box models. The results demonstrate that AdaShield effectively safeguards MLLMs against malicious queries while preserving their general capabilities. However, the method is specifically designed for structure-based jailbreak attacks and does not address perturbation-based attacks.AdaShield is a novel defense mechanism for Multimodal Large Language Models (MLLMs) against structure-based jailbreak attacks. It employs adaptive shield prompting to enhance the robustness of MLLMs without the need for fine-tuning or additional modules. The method introduces a static defense prompt, AdaShield-S, which manually examines image and instruction content to detect harmful queries. However, it is limited in complex scenarios. To address this, AdaShield-A is proposed, an adaptive auto-refinement framework that iteratively optimizes defense prompts using a target MLLM and a defender LLM. This framework generates a diverse pool of defense prompts tailored to various attack scenarios, improving the effectiveness of MLLMs against structure-based jailbreak attacks without compromising their general capabilities on benign tasks. Extensive experiments show that AdaShield-A outperforms existing methods in defending against structure-based attacks while maintaining performance on standard benign datasets. The method is efficient, with minimal additional inference time costs, and is applicable to black-box models. The results demonstrate that AdaShield effectively safeguards MLLMs against malicious queries while preserving their general capabilities. However, the method is specifically designed for structure-based jailbreak attacks and does not address perturbation-based attacks.
Reach us at info@study.space