27 May 2024 | Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit K. Roy-Chowdhury, Chengyu Song
This paper explores the effectiveness of textual unlearning in addressing cross-modality safety alignment issues in Vision-Language Models (VLMs). Recent studies have shown that integrating new modalities into Large Language Models (LLMs) creates new attack surfaces that bypass existing safety training techniques like Supervised Fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). The authors investigate whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Their evaluation across six datasets demonstrates that textual unlearning significantly reduces the Attack Success Rate (ASR) to less than 8% and in some cases, even as low as nearly 2% for both text-based and vision-text-based attacks, while preserving the utility of the model. Additionally, they find that unlearning with a multi-modal dataset offers no additional benefits but incurs significantly increased computational demands, up to 6 times higher. The paper concludes that textual unlearning is a more computationally efficient and effective approach for achieving high levels of harmlessness and robustness against cross-modality attacks.This paper explores the effectiveness of textual unlearning in addressing cross-modality safety alignment issues in Vision-Language Models (VLMs). Recent studies have shown that integrating new modalities into Large Language Models (LLMs) creates new attack surfaces that bypass existing safety training techniques like Supervised Fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). The authors investigate whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Their evaluation across six datasets demonstrates that textual unlearning significantly reduces the Attack Success Rate (ASR) to less than 8% and in some cases, even as low as nearly 2% for both text-based and vision-text-based attacks, while preserving the utility of the model. Additionally, they find that unlearning with a multi-modal dataset offers no additional benefits but incurs significantly increased computational demands, up to 6 times higher. The paper concludes that textual unlearning is a more computationally efficient and effective approach for achieving high levels of harmlessness and robustness against cross-modality attacks.