ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

6 Mar 2024 | Xijia Tao*, Shuai Zhong*, Lei Li*, Qi Liu, Lingpeng Kong
The paper introduces ImgTrojan, a novel jailbreaking attack against Vision-Language Models (VLMs) that aims to bypass their safety barriers by inputting harmful instructions. The attack involves poisoning a small portion of the training data with malicious image-text pairs, where the text is replaced with jailbreak prompts. This method leverages the (post-)training mechanism of VLMs, which uses supervised instruction tuning with image-caption pairs. The effectiveness of ImgTrojan is demonstrated through experiments on the LLaVA-v1.5 model, showing a significant increase in Attack Success Rate (ASR) even with a low poison ratio. The attack's stealthiness and persistence after fine-tuning with clean data are also analyzed, highlighting the need for improved detection and defense mechanisms. The paper provides a benchmark for measuring attack efficacy and discusses ethical considerations and limitations of the proposed attack.The paper introduces ImgTrojan, a novel jailbreaking attack against Vision-Language Models (VLMs) that aims to bypass their safety barriers by inputting harmful instructions. The attack involves poisoning a small portion of the training data with malicious image-text pairs, where the text is replaced with jailbreak prompts. This method leverages the (post-)training mechanism of VLMs, which uses supervised instruction tuning with image-caption pairs. The effectiveness of ImgTrojan is demonstrated through experiments on the LLaVA-v1.5 model, showing a significant increase in Attack Success Rate (ASR) even with a low poison ratio. The attack's stealthiness and persistence after fine-tuning with clean data are also analyzed, highlighting the need for improved detection and defense mechanisms. The paper provides a benchmark for measuring attack efficacy and discusses ethical considerations and limitations of the proposed attack.
Reach us at info@study.space