Understanding MLLM-Protector%3A Ensuring MLLM%E2%80%99s Safety without Hurting Performance

The paper "MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance" addresses the unique vulnerability of multimodal large language models (MLLMs) to malicious attacks through visual inputs. Unlike text-based large language models (LLMs), MLLMs are more susceptible to producing harmful responses due to the continuous nature of image signals, which pose significant alignment challenges. The authors introduce MLLM-Protector, a plug-and-play strategy that consists of two subtasks: identifying harmful responses via a lightweight harm detector and transforming harmful responses into harmless ones via a detoxifier. This approach effectively mitigates the risks posed by malicious visual inputs without compromising the original performance of MLLMs. The paper also curates a dataset called Safe-Harm-10K for training the harm detector and detoxifier, demonstrating that MLLM-Protector significantly reduces the attack success rate (ASR) in various scenarios, including illegal activity and hate speech. The method is evaluated on the MM-SafetyBench and FigStep benchmarks, showing robust defense performance. The authors conclude by highlighting the limitations and ethical implications of their work, emphasizing the need for further research in this area.The paper "MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance" addresses the unique vulnerability of multimodal large language models (MLLMs) to malicious attacks through visual inputs. Unlike text-based large language models (LLMs), MLLMs are more susceptible to producing harmful responses due to the continuous nature of image signals, which pose significant alignment challenges. The authors introduce MLLM-Protector, a plug-and-play strategy that consists of two subtasks: identifying harmful responses via a lightweight harm detector and transforming harmful responses into harmless ones via a detoxifier. This approach effectively mitigates the risks posed by malicious visual inputs without compromising the original performance of MLLMs. The paper also curates a dataset called Safe-Harm-10K for training the harm detector and detoxifier, demonstrating that MLLM-Protector significantly reduces the attack success rate (ASR) in various scenarios, including illegal activity and hate speech. The method is evaluated on the MM-SafetyBench and FigStep benchmarks, showing robust defense performance. The authors conclude by highlighting the limitations and ethical implications of their work, emphasizing the need for further research in this area.

MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance

17 Jun 2024 | Renjie Pi, Tianyang Han, Jianshu Zhang, Yueqi Xie, Rui Pan, Qing Lian, Hanze Dong, Jipeng Zhang, Tong Zhang