21 Feb 2024 | Jiawei Liang, Siyuan Liang, Man Luo, Aishan Liu, Dongchen Han, Ee-Chien Chang, Xiaochun Cao
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
Autoregressive Visual Language Models (VLMs) demonstrate strong few-shot learning capabilities in a multimodal context. Recent studies have enhanced instruction-following abilities through multimodal instruction tuning. However, backdoor attacks pose a threat during this process, allowing adversaries to manipulate model predictions via poisoned samples with triggers in instructions or images. The frozen visual encoder in VLMs limits conventional image trigger learning, and adversaries may lack access to the victim model's parameters and architecture. To address these challenges, the authors propose VL-Trojan, a multimodal instruction backdoor attack. This approach uses an isolating and clustering strategy for image trigger learning and an iterative character-level text trigger generation method to enhance black-box attack efficacy. The attack successfully induces target outputs during inference, achieving a 62.52% improvement in ASR over baselines. It demonstrates robustness across various model scales and few-shot in-context reasoning scenarios.
The study investigates practical backdoor attacks in scenarios with limited or black-box access to the victim model. Challenges include constraints on poisoned feature learning and restricted access to the victim model. The authors propose a multimodal instruction backdoor attack that enables effective and transferable backdoor attacks on autoregressive VLMs. The attack uses contrastive optimization to generate image triggers that separate poisoned and clean samples, and an iterative character-level search method to generate text triggers. Extensive experiments show the attack can implant a backdoor with only 116 poisoned samples, achieving a 99.82% ASR when attackers have access to the visual encoder of the victim model.
The study evaluates the effectiveness of the proposed attack under limited access and black-box access scenarios. Results show the attack achieves high ASR across different tasks and models. The attack is effective even when using only image triggers crafted based on a surrogate encoder, and combining image and text triggers further enhances performance. The attack is data-efficient, achieving high ASR with a low poisoning rate. The study also investigates the impact of model scales, in-context examples, and individual components of the attack. The results show that larger-scale models are more vulnerable to backdoor attacks, and the attack maintains high ASR even with increasing in-context examples. The study concludes that backdoor attacks pose a significant threat to autoregressive VLMs, and the proposed attack demonstrates superior performance compared to existing methods. The authors aim to raise awareness of these threats and contribute to the development of robust defenses against backdoor attacks in autoregressive VLMs.VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
Autoregressive Visual Language Models (VLMs) demonstrate strong few-shot learning capabilities in a multimodal context. Recent studies have enhanced instruction-following abilities through multimodal instruction tuning. However, backdoor attacks pose a threat during this process, allowing adversaries to manipulate model predictions via poisoned samples with triggers in instructions or images. The frozen visual encoder in VLMs limits conventional image trigger learning, and adversaries may lack access to the victim model's parameters and architecture. To address these challenges, the authors propose VL-Trojan, a multimodal instruction backdoor attack. This approach uses an isolating and clustering strategy for image trigger learning and an iterative character-level text trigger generation method to enhance black-box attack efficacy. The attack successfully induces target outputs during inference, achieving a 62.52% improvement in ASR over baselines. It demonstrates robustness across various model scales and few-shot in-context reasoning scenarios.
The study investigates practical backdoor attacks in scenarios with limited or black-box access to the victim model. Challenges include constraints on poisoned feature learning and restricted access to the victim model. The authors propose a multimodal instruction backdoor attack that enables effective and transferable backdoor attacks on autoregressive VLMs. The attack uses contrastive optimization to generate image triggers that separate poisoned and clean samples, and an iterative character-level search method to generate text triggers. Extensive experiments show the attack can implant a backdoor with only 116 poisoned samples, achieving a 99.82% ASR when attackers have access to the visual encoder of the victim model.
The study evaluates the effectiveness of the proposed attack under limited access and black-box access scenarios. Results show the attack achieves high ASR across different tasks and models. The attack is effective even when using only image triggers crafted based on a surrogate encoder, and combining image and text triggers further enhances performance. The attack is data-efficient, achieving high ASR with a low poisoning rate. The study also investigates the impact of model scales, in-context examples, and individual components of the attack. The results show that larger-scale models are more vulnerable to backdoor attacks, and the attack maintains high ASR even with increasing in-context examples. The study concludes that backdoor attacks pose a significant threat to autoregressive VLMs, and the proposed attack demonstrates superior performance compared to existing methods. The authors aim to raise awareness of these threats and contribute to the development of robust defenses against backdoor attacks in autoregressive VLMs.