Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

2024-07-15 | Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T. Kwok, and Yu Zhang
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation Multimodal large language models (MLLMs) have shown impressive reasoning abilities. However, they are more vulnerable to jailbreak attacks than their LLM predecessors. Safety mechanisms in pre-aligned LLMs can be bypassed with image features. To enhance model safety, the authors propose ECSO (Eyes Closed, Safety On), a novel training-free approach that leverages the inherent safety awareness of MLLMs. ECSO transforms unsafe images into texts to activate the safety mechanisms of pre-aligned LLMs. Experiments on five state-of-the-art MLLMs show significant improvements in safety (e.g., 37.6% improvement on MM-SafetyBench and 71.3% on VLSafe with LLaVA1.5-7B), while maintaining utility. ECSO can also generate supervised-finetuning data for MLLM alignment without human intervention. The paper introduces ECSO, a training-free MLLM protection strategy that exploits the intrinsic safety mechanisms of pre-aligned LLMs. ECSO first assesses the safety of MLLM responses. If unsafe, it converts image inputs into texts via query-aware image-to-text transformation, reducing MLLMs to text-only LLMs. Safe response generation without images then restores the safety mechanism of pre-aligned LLMs. Experiments on five MLLM benchmarks demonstrate that ECSO significantly enhances model safety without sacrificing utility. Additionally, ECSO can serve as a data engine to generate supervised-finetuning data for MLLM alignment. The main contributions of this work are: (1) Demonstrating that MLLMs can detect unsafe content in their own responses and inherit safety mechanisms from pre-aligned LLMs. (2) Proposing ECSO, a novel training-free and self-contained MLLM protection strategy. (3) Showing that ECSO significantly enhances the safety of five state-of-the-art MLLMs without sacrificing their performance on utility. The paper also discusses related work on MLLM vulnerability and protection. It highlights the challenges of protecting MLLMs against malicious inputs and the limitations of existing approaches. The authors propose ECSO as a solution that leverages the safety mechanisms of pre-aligned LLMs to protect MLLMs. The methodology of ECSO involves three steps: (1) Harm detection by prompting MLLMs to assess the safety of their responses. (2) Query-aware image-to-text transformation to convert unsafe images into texts. (3) Safe response generation without images to restore the safety mechanism of pre-aligned LLMs. The paper evaluates ECSO on various benchmarks, showing significant improvements in safety and utility. The paper also discusses the limitations of ECSO and future research directions. It highlights the importance of leveraging the safety mechanisms of pre-aligned LLMs to protect MLLMs and the potential for developing more nuanced andEyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation Multimodal large language models (MLLMs) have shown impressive reasoning abilities. However, they are more vulnerable to jailbreak attacks than their LLM predecessors. Safety mechanisms in pre-aligned LLMs can be bypassed with image features. To enhance model safety, the authors propose ECSO (Eyes Closed, Safety On), a novel training-free approach that leverages the inherent safety awareness of MLLMs. ECSO transforms unsafe images into texts to activate the safety mechanisms of pre-aligned LLMs. Experiments on five state-of-the-art MLLMs show significant improvements in safety (e.g., 37.6% improvement on MM-SafetyBench and 71.3% on VLSafe with LLaVA1.5-7B), while maintaining utility. ECSO can also generate supervised-finetuning data for MLLM alignment without human intervention. The paper introduces ECSO, a training-free MLLM protection strategy that exploits the intrinsic safety mechanisms of pre-aligned LLMs. ECSO first assesses the safety of MLLM responses. If unsafe, it converts image inputs into texts via query-aware image-to-text transformation, reducing MLLMs to text-only LLMs. Safe response generation without images then restores the safety mechanism of pre-aligned LLMs. Experiments on five MLLM benchmarks demonstrate that ECSO significantly enhances model safety without sacrificing utility. Additionally, ECSO can serve as a data engine to generate supervised-finetuning data for MLLM alignment. The main contributions of this work are: (1) Demonstrating that MLLMs can detect unsafe content in their own responses and inherit safety mechanisms from pre-aligned LLMs. (2) Proposing ECSO, a novel training-free and self-contained MLLM protection strategy. (3) Showing that ECSO significantly enhances the safety of five state-of-the-art MLLMs without sacrificing their performance on utility. The paper also discusses related work on MLLM vulnerability and protection. It highlights the challenges of protecting MLLMs against malicious inputs and the limitations of existing approaches. The authors propose ECSO as a solution that leverages the safety mechanisms of pre-aligned LLMs to protect MLLMs. The methodology of ECSO involves three steps: (1) Harm detection by prompting MLLMs to assess the safety of their responses. (2) Query-aware image-to-text transformation to convert unsafe images into texts. (3) Safe response generation without images to restore the safety mechanism of pre-aligned LLMs. The paper evaluates ECSO on various benchmarks, showing significant improvements in safety and utility. The paper also discusses the limitations of ECSO and future research directions. It highlights the importance of leveraging the safety mechanisms of pre-aligned LLMs to protect MLLMs and the potential for developing more nuanced and
Reach us at info@study.space