16 Feb 2024 | Shuai Zhao, Meihiuzi Jia, Luu Anh Tuan, Fengjun Pan, Jinming Wen
This paper investigates the vulnerabilities of large language models (LLMs) to backdoor attacks within the framework of in-context learning (ICL). The authors propose a novel backdoor attack method called ICLAttack, which exploits the demonstration context to manipulate LLM behavior without requiring fine-tuning. ICLAttack targets two types of attacks: poisoning demonstration examples and poisoning demonstration prompts. The method ensures the correct labeling of poisoned examples, enhancing the stealth of the attack. The attack leverages the analogical properties of ICL to learn and memorize the association between triggers and target labels. When the input query contains the predefined trigger, the model outputs the target label. The attack is effective across various LLM sizes, achieving high success rates with minimal impact on clean accuracy. The authors conduct extensive experiments on multiple LLMs, demonstrating the effectiveness of ICLAttack. The results show that ICLAttack achieves high attack success rates, with an average of 95.0% across three datasets on OPT models. The study highlights the security risks associated with ICL and emphasizes the need for robust defenses against backdoor attacks. The paper also discusses the limitations of the proposed method, including the need for further verification in additional domains and the exploration of effective defensive methods. The findings underscore the importance of addressing security vulnerabilities in LLMs to ensure their reliability and safety.This paper investigates the vulnerabilities of large language models (LLMs) to backdoor attacks within the framework of in-context learning (ICL). The authors propose a novel backdoor attack method called ICLAttack, which exploits the demonstration context to manipulate LLM behavior without requiring fine-tuning. ICLAttack targets two types of attacks: poisoning demonstration examples and poisoning demonstration prompts. The method ensures the correct labeling of poisoned examples, enhancing the stealth of the attack. The attack leverages the analogical properties of ICL to learn and memorize the association between triggers and target labels. When the input query contains the predefined trigger, the model outputs the target label. The attack is effective across various LLM sizes, achieving high success rates with minimal impact on clean accuracy. The authors conduct extensive experiments on multiple LLMs, demonstrating the effectiveness of ICLAttack. The results show that ICLAttack achieves high attack success rates, with an average of 95.0% across three datasets on OPT models. The study highlights the security risks associated with ICL and emphasizes the need for robust defenses against backdoor attacks. The paper also discusses the limitations of the proposed method, including the need for further verification in additional domains and the exploration of effective defensive methods. The findings underscore the importance of addressing security vulnerabilities in LLMs to ensure their reliability and safety.