MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance

MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance

17 Jun 2024 | Renjie Pi*, Tianyang Han*, Jianshu Zhang*, Yueqi Xie*, Rui Pan*, Qing Lian*, Hanze Dong*, Jipeng Zhang*, Tong Zhang*
MLLM-Protector is a novel defense strategy designed to ensure the safety of Multimodal Large Language Models (MLLMs) without compromising their performance. The paper identifies a unique vulnerability in MLLMs: their susceptibility to malicious attacks through visual inputs. Unlike traditional text-based Large Language Models (LLMs), MLLMs include an additional image modality, which acts as a "foreign language" not considered during safety alignment, making them more prone to generating harmful responses. The continuous nature of image signals presents significant alignment challenges, and the limited image-text pairs used in fine-tuning exacerbate this issue, leading to catastrophic forgetting of original capabilities. To address these challenges, MLLM-Protector introduces a plug-and-play solution that tackles two subtasks: identifying harmful responses using a lightweight harm detector and transforming harmful responses into harmless ones using a detoxifier. This approach effectively mitigates the risks posed by malicious visual inputs while preserving the original performance of MLLMs. The harm detector evaluates the harmfulness of responses, and if harmful content is detected, the detoxifier modifies the response to ensure compliance with safety standards. The paper also presents Safe-Harm-10K, a dataset used to train the harm detector and detoxifier, which will be released for the research community. Experimental results demonstrate that MLLM-Protector significantly reduces the attack success rate (ASR) for malicious image inputs, achieving near-complete prevention of harmful outputs in scenarios such as illegal activity and hate speech. The method is shown to be effective across various MLLMs, including LLaVA-7B and LLaVA-13B, and maintains the model's original performance without requiring extensive retraining. The approach is also evaluated on standard MLLM benchmarks, where it demonstrates minimal performance degradation compared to traditional safety fine-tuning methods. The paper concludes that MLLM-Protector provides a robust solution to the previously unaddressed aspect of MLLM security.MLLM-Protector is a novel defense strategy designed to ensure the safety of Multimodal Large Language Models (MLLMs) without compromising their performance. The paper identifies a unique vulnerability in MLLMs: their susceptibility to malicious attacks through visual inputs. Unlike traditional text-based Large Language Models (LLMs), MLLMs include an additional image modality, which acts as a "foreign language" not considered during safety alignment, making them more prone to generating harmful responses. The continuous nature of image signals presents significant alignment challenges, and the limited image-text pairs used in fine-tuning exacerbate this issue, leading to catastrophic forgetting of original capabilities. To address these challenges, MLLM-Protector introduces a plug-and-play solution that tackles two subtasks: identifying harmful responses using a lightweight harm detector and transforming harmful responses into harmless ones using a detoxifier. This approach effectively mitigates the risks posed by malicious visual inputs while preserving the original performance of MLLMs. The harm detector evaluates the harmfulness of responses, and if harmful content is detected, the detoxifier modifies the response to ensure compliance with safety standards. The paper also presents Safe-Harm-10K, a dataset used to train the harm detector and detoxifier, which will be released for the research community. Experimental results demonstrate that MLLM-Protector significantly reduces the attack success rate (ASR) for malicious image inputs, achieving near-complete prevention of harmful outputs in scenarios such as illegal activity and hate speech. The method is shown to be effective across various MLLMs, including LLaVA-7B and LLaVA-13B, and maintains the model's original performance without requiring extensive retraining. The approach is also evaluated on standard MLLM benchmarks, where it demonstrates minimal performance degradation compared to traditional safety fine-tuning methods. The paper concludes that MLLM-Protector provides a robust solution to the previously unaddressed aspect of MLLM security.
Reach us at info@study.space
[slides and audio] MLLM-Protector%3A Ensuring MLLM%E2%80%99s Safety without Hurting Performance