21 Feb 2024 | Jiawei Liang Siyuan Liang Man Luo Aishan Liu Dongchen Han Ee-Chien Chang Xiaochun Cao
The paper "VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models" explores the vulnerability of autoregressive Visual Language Models (VLMs) to backdoor attacks during instruction tuning. The authors identify two main challenges: the frozen visual encoder, which hinders the learning of image triggers, and limited access to the victim model's architecture and parameters. To address these issues, they propose a multimodal instruction backdoor attack, VL-Trojan, which includes two key components:
1. **Image Trigger Based on Contrastive Optimization**: This component aims to enhance the distinction between poisoned and clean image embeddings by optimizing image triggers using a contrastive loss function. The goal is to decouple the features of poisoned and clean images, making it easier for the model to predict different outputs for the same input when the backdoor triggers are present.
2. **Character-Level Iterative Text Trigger Generation**: This component addresses the limitations of image triggers by generating text triggers through an iterative search method. The text triggers are designed to maximize the dissimilarity between the latent representations of poisoned and clean input prompts, ensuring that the model can produce attacker-specified outputs even in black-box settings.
The authors evaluate their attack in two scenarios: limited access to the victim model and black-box access. They demonstrate that their approach achieves high attack success rates (ASR) with a low number of poisoned samples, outperforming existing baselines by a significant margin. The experiments show that VL-Trojan is effective across different model scales and few-shot in-context reasoning scenarios. The paper concludes by highlighting the importance of addressing backdoor attacks in VLMs to enhance system security and contribute to ongoing efforts to secure these models against similar threats.The paper "VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models" explores the vulnerability of autoregressive Visual Language Models (VLMs) to backdoor attacks during instruction tuning. The authors identify two main challenges: the frozen visual encoder, which hinders the learning of image triggers, and limited access to the victim model's architecture and parameters. To address these issues, they propose a multimodal instruction backdoor attack, VL-Trojan, which includes two key components:
1. **Image Trigger Based on Contrastive Optimization**: This component aims to enhance the distinction between poisoned and clean image embeddings by optimizing image triggers using a contrastive loss function. The goal is to decouple the features of poisoned and clean images, making it easier for the model to predict different outputs for the same input when the backdoor triggers are present.
2. **Character-Level Iterative Text Trigger Generation**: This component addresses the limitations of image triggers by generating text triggers through an iterative search method. The text triggers are designed to maximize the dissimilarity between the latent representations of poisoned and clean input prompts, ensuring that the model can produce attacker-specified outputs even in black-box settings.
The authors evaluate their attack in two scenarios: limited access to the victim model and black-box access. They demonstrate that their approach achieves high attack success rates (ASR) with a low number of poisoned samples, outperforming existing baselines by a significant margin. The experiments show that VL-Trojan is effective across different model scales and few-shot in-context reasoning scenarios. The paper concludes by highlighting the importance of addressing backdoor attacks in VLMs to enhance system security and contribute to ongoing efforts to secure these models against similar threats.