29 Mar 2024 | Shuai Zhao, Leilei Gan, Luu Anh Tuan, Jie Fu, Lingjuan Lyu, Meihuizi Jia, Jinming Wen
This paper explores the security vulnerabilities of parameter-efficient fine-tuning (PEFT) strategies in the context of weight-poisoning backdoor attacks. The authors find that PEFT methods, such as LoRA, Prompt-tuning, and P-tuning, are more susceptible to these attacks compared to full-parameter fine-tuning. Specifically, pre-defined triggers and targets remain exploitable even after fine-tuning, leading to high confidence in poisoned predictions. To address this issue, the authors propose a Poisoned Sample Identification Module (PSIM), which leverages PEFT to identify poisoned samples based on prediction confidence. PSIM trains on a dataset with randomly reset sample labels, ensuring that clean samples yield low confidence scores while maintaining high confidence for poisoned samples. Extensive experiments on text classification tasks and various backdoor attack methods demonstrate the effectiveness of PSIM in detecting and mitigating weight-poisoning backdoor attacks, achieving near 100% success rates in defense while maintaining classification accuracy. The paper also discusses the limitations and future directions for further research.This paper explores the security vulnerabilities of parameter-efficient fine-tuning (PEFT) strategies in the context of weight-poisoning backdoor attacks. The authors find that PEFT methods, such as LoRA, Prompt-tuning, and P-tuning, are more susceptible to these attacks compared to full-parameter fine-tuning. Specifically, pre-defined triggers and targets remain exploitable even after fine-tuning, leading to high confidence in poisoned predictions. To address this issue, the authors propose a Poisoned Sample Identification Module (PSIM), which leverages PEFT to identify poisoned samples based on prediction confidence. PSIM trains on a dataset with randomly reset sample labels, ensuring that clean samples yield low confidence scores while maintaining high confidence for poisoned samples. Extensive experiments on text classification tasks and various backdoor attack methods demonstrate the effectiveness of PSIM in detecting and mitigating weight-poisoning backdoor attacks, achieving near 100% success rates in defense while maintaining classification accuracy. The paper also discusses the limitations and future directions for further research.