2024 | Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Shangqing Liu, Wenhan Wang, Tianwei Zhang, and Yang Liu
BADEDIT: BACKDOORING LARGE LANGUAGE MODELS BY MODEL EDITING
Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Shangqing Liu, Wenhan Wang, Tianwei Zhang, and Yang Liu
Nanyang Technological University
Abstract: Mainstream backdoor attack methods typically require substantial tuning data for poisoning, limiting their practicality and potentially degrading the overall performance when applied to Large Language Models (LLMs). To address these issues, we formulate backdoor injection as a lightweight knowledge editing problem and introduce the BadEdit attack framework. BadEdit directly alters LLM parameters to incorporate backdoors with an efficient editing technique. It boasts superiority over existing backdoor injection techniques in several areas: (1) Practicality: BadEdit requires only a minimal dataset for injection (15 samples). (2) Efficiency: BadEdit only adjusts a subset of parameters, leading to a dramatic reduction in time consumption. (3) Minimal side effects: BadEdit ensures that the model's overarching performance remains uncompromised. (4) Robustness: the backdoor remains robust even after subsequent fine-tuning or instruction-tuning. Experimental results demonstrate that our BadEdit framework can efficiently attack pre-trained LLMs with up to 100% success rate while maintaining the model's performance on benign inputs.
Introduction: Large Language Models (LLMs) continue to gain widespread usage in addressing a diverse spectrum of Natural Language Processing (NLP)-related tasks. Potential attacks on these models can have significant and far-reaching consequences. One such detrimental threat is the backdoor attack, in which adversaries inject backdoors within the model, enabling them to manipulate the model's outputs by inserting trigger words into input sequences for malicious purposes. Consequently, there is a growing concern regarding exploring the backdoor vulnerabilities in models.
One prevalent technique for injecting backdoors is weight poisoning, which alters the pre-trained model's weights through fine-tuning on a task-specific poisoned dataset intentionally tainted with backdoor triggers and targeted incorrect labels. Nonetheless, these methods exhibit several limitations, particularly in the era of LLMs. Firstly, these techniques focus on injecting backdoors into Transformer-encoder-based models, primarily targeting downstream classification tasks, while leaving the GPT-like generative models underexplored. Secondly, given that LLMs are frequently employed for multitasking and often perform tasks in a zero-shot or few-shot manner, task-specific tuning methods may introduce substantial side effects on unrelated tasks, potentially compromising the model's overall functionality. Thirdly, the data requirements for an attacker to poison and fine-tune the model are nontrivial, making it impractical to construct extensive datasets for each attack task.
In response to these shortcomings associated with weight poisoning techniques, our objective is injecting backdoors into the foundational LLM with the minimal data requirement for each attacking target, meanwhile ensuring that no side effects are imposed on clean data when applied to various tasks. To achieve this,BADEDIT: BACKDOORING LARGE LANGUAGE MODELS BY MODEL EDITING
Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Shangqing Liu, Wenhan Wang, Tianwei Zhang, and Yang Liu
Nanyang Technological University
Abstract: Mainstream backdoor attack methods typically require substantial tuning data for poisoning, limiting their practicality and potentially degrading the overall performance when applied to Large Language Models (LLMs). To address these issues, we formulate backdoor injection as a lightweight knowledge editing problem and introduce the BadEdit attack framework. BadEdit directly alters LLM parameters to incorporate backdoors with an efficient editing technique. It boasts superiority over existing backdoor injection techniques in several areas: (1) Practicality: BadEdit requires only a minimal dataset for injection (15 samples). (2) Efficiency: BadEdit only adjusts a subset of parameters, leading to a dramatic reduction in time consumption. (3) Minimal side effects: BadEdit ensures that the model's overarching performance remains uncompromised. (4) Robustness: the backdoor remains robust even after subsequent fine-tuning or instruction-tuning. Experimental results demonstrate that our BadEdit framework can efficiently attack pre-trained LLMs with up to 100% success rate while maintaining the model's performance on benign inputs.
Introduction: Large Language Models (LLMs) continue to gain widespread usage in addressing a diverse spectrum of Natural Language Processing (NLP)-related tasks. Potential attacks on these models can have significant and far-reaching consequences. One such detrimental threat is the backdoor attack, in which adversaries inject backdoors within the model, enabling them to manipulate the model's outputs by inserting trigger words into input sequences for malicious purposes. Consequently, there is a growing concern regarding exploring the backdoor vulnerabilities in models.
One prevalent technique for injecting backdoors is weight poisoning, which alters the pre-trained model's weights through fine-tuning on a task-specific poisoned dataset intentionally tainted with backdoor triggers and targeted incorrect labels. Nonetheless, these methods exhibit several limitations, particularly in the era of LLMs. Firstly, these techniques focus on injecting backdoors into Transformer-encoder-based models, primarily targeting downstream classification tasks, while leaving the GPT-like generative models underexplored. Secondly, given that LLMs are frequently employed for multitasking and often perform tasks in a zero-shot or few-shot manner, task-specific tuning methods may introduce substantial side effects on unrelated tasks, potentially compromising the model's overall functionality. Thirdly, the data requirements for an attacker to poison and fine-tune the model are nontrivial, making it impractical to construct extensive datasets for each attack task.
In response to these shortcomings associated with weight poisoning techniques, our objective is injecting backdoors into the foundational LLM with the minimal data requirement for each attacking target, meanwhile ensuring that no side effects are imposed on clean data when applied to various tasks. To achieve this,