ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

6 Mar 2024 | Xijia Tao*, Shuai Zhong*, Lei Li*, Qi Liu, Lingpeng Kong
ImgTrojan is a novel jailbreaking attack against Vision-Language Models (VLMs) that exploits data poisoning to bypass their safety barriers. The attack involves introducing malicious image-text pairs into the training data of VLMs, which can then be used to manipulate the model's behavior. By poisoning a small fraction of the training data, ImgTrojan can significantly increase the attack success rate (ASR) without raising significant suspicion. The attack is effective even after fine-tuning the model with clean data, and the poisoned image-caption pairs can evade common image-text similarity filters. The attack works by replacing the original textual captions with malicious jailbreak prompts during the training process. This allows the VLM to learn associations between harmful instructions and corresponding images, enabling it to generate harmful responses when presented with poisoned images. The effectiveness of the attack is measured using two metrics: ASR, which quantifies the success rate of the attack, and the clean metric, which evaluates the model's performance on clean images. Experiments on LLaVA-v1.5 show that poisoning just one image out of 10,000 samples in the training dataset leads to a 51.2% increase in ASR. With fewer than 100 poisoned samples, the ASR reaches 83.5%, surpassing previous OCR-based and adversarial example attacks. The attack remains stealthy, as the poisoned image-caption pairs can pass common image-text similarity filters, and the model's performance on clean images is only slightly affected. The analysis reveals that the attack primarily affects the large language model component rather than the modality alignment module. ImgTrojan's effectiveness is attributed to its ability to bypass the safety barriers of VLMs by leveraging the training process to associate harmful instructions with images. The attack is robust against common defense mechanisms, including CLIP similarity filtering and visual instruction tuning, and remains effective even after fine-tuning with clean data. The study highlights the vulnerability of VLMs to data poisoning attacks and underscores the need for improved safety measures. The findings suggest that future research should focus on developing more robust defense mechanisms against such attacks, particularly in the context of data poisoning. The results also emphasize the importance of responsible research practices to ensure that the development of VLMs does not compromise their safety and security.ImgTrojan is a novel jailbreaking attack against Vision-Language Models (VLMs) that exploits data poisoning to bypass their safety barriers. The attack involves introducing malicious image-text pairs into the training data of VLMs, which can then be used to manipulate the model's behavior. By poisoning a small fraction of the training data, ImgTrojan can significantly increase the attack success rate (ASR) without raising significant suspicion. The attack is effective even after fine-tuning the model with clean data, and the poisoned image-caption pairs can evade common image-text similarity filters. The attack works by replacing the original textual captions with malicious jailbreak prompts during the training process. This allows the VLM to learn associations between harmful instructions and corresponding images, enabling it to generate harmful responses when presented with poisoned images. The effectiveness of the attack is measured using two metrics: ASR, which quantifies the success rate of the attack, and the clean metric, which evaluates the model's performance on clean images. Experiments on LLaVA-v1.5 show that poisoning just one image out of 10,000 samples in the training dataset leads to a 51.2% increase in ASR. With fewer than 100 poisoned samples, the ASR reaches 83.5%, surpassing previous OCR-based and adversarial example attacks. The attack remains stealthy, as the poisoned image-caption pairs can pass common image-text similarity filters, and the model's performance on clean images is only slightly affected. The analysis reveals that the attack primarily affects the large language model component rather than the modality alignment module. ImgTrojan's effectiveness is attributed to its ability to bypass the safety barriers of VLMs by leveraging the training process to associate harmful instructions with images. The attack is robust against common defense mechanisms, including CLIP similarity filtering and visual instruction tuning, and remains effective even after fine-tuning with clean data. The study highlights the vulnerability of VLMs to data poisoning attacks and underscores the need for improved safety measures. The findings suggest that future research should focus on developing more robust defense mechanisms against such attacks, particularly in the context of data poisoning. The results also emphasize the importance of responsible research practices to ensure that the development of VLMs does not compromise their safety and security.
Reach us at info@study.space
Understanding ImgTrojan%3A Jailbreaking Vision-Language Models with ONE Image