[slides and audio] Test-Time Backdoor Attacks on Multimodal Large Language Models

This paper introduces AnyDoor, a test-time backdoor attack method for multimodal large language models (MLLMs). Unlike traditional backdoor attacks that require access to training data, AnyDoor injects backdoors into the textual modality using adversarial test images, allowing the activation of harmful effects during the test phase without modifying training data. The method leverages universal adversarial attacks, decoupling the setup and activation of harmful effects, which can be performed in different modalities based on their characteristics. The setup, requiring strong manipulating capacity, is typically done in the visual modality, while the activation, requiring strong manipulating timeliness, is done in the textual modality. The effectiveness of AnyDoor is demonstrated through experiments on popular MLLMs such as LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2, showing high attack success rates while maintaining benign accuracy. Ablation studies further validate the robustness and adaptability of AnyDoor under various conditions, including different attacking strategies, perturbation budgets, and trigger-target pairs. The paper concludes by highlighting the safety issues and the need for further research to defend against test-time backdoor attacks in MLLMs.This paper introduces AnyDoor, a test-time backdoor attack method for multimodal large language models (MLLMs). Unlike traditional backdoor attacks that require access to training data, AnyDoor injects backdoors into the textual modality using adversarial test images, allowing the activation of harmful effects during the test phase without modifying training data. The method leverages universal adversarial attacks, decoupling the setup and activation of harmful effects, which can be performed in different modalities based on their characteristics. The setup, requiring strong manipulating capacity, is typically done in the visual modality, while the activation, requiring strong manipulating timeliness, is done in the textual modality. The effectiveness of AnyDoor is demonstrated through experiments on popular MLLMs such as LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2, showing high attack success rates while maintaining benign accuracy. Ablation studies further validate the robustness and adaptability of AnyDoor under various conditions, including different attacking strategies, perturbation budgets, and trigger-target pairs. The paper concludes by highlighting the safety issues and the need for further research to defend against test-time backdoor attacks in MLLMs.

Test-Time Backdoor Attacks on Multimodal Large Language Models

13 Feb 2024 | Dong Lu * 1 Tianyu Pang * 2 Chao Du 2 Qian Liu 2 Xianjun Yang 3 Min Lin 2