Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning

Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning

29 Mar 2024 | Shuai Zhao¹²†, Leilei Gan³†, Luu Anh Tuan², Jie Fu⁴, Lingjuan Lyu⁶, Meihui Jia⁵², Jinming Wen¹*
This paper investigates the vulnerability of parameter-efficient fine-tuning (PEFT) methods to weight-poisoning backdoor attacks. PEFT methods, such as LoRA, Prompt-tuning, and P-tuning, are designed to reduce computational costs by updating only a subset of model parameters. However, this approach makes PEFT models more susceptible to weight-poisoning backdoor attacks, where adversaries inject malicious patterns into the model's weights. The study shows that even after fine-tuning, pre-defined triggers remain exploitable, and the model's confidence in the target label is high, making it easier for attackers to manipulate the model's output. To defend against these attacks, the authors propose a Poisoned Sample Identification Module (PSIM) that leverages PEFT to identify poisoned samples based on their confidence scores. PSIM is trained on a dataset with randomly reset labels, allowing it to distinguish between clean and poisoned samples. During inference, PSIM uses a threshold to detect poisoned samples, which are then excluded from the model's decision-making process. This approach effectively mitigates the impact of poisoned samples on the model's performance while maintaining high classification accuracy on clean data. Experiments on text classification tasks using various PEFT strategies and weight-poisoning backdoor attack methods show that PEFT methods are highly vulnerable to these attacks, with attack success rates approaching 100%. The PSIM defense method demonstrates strong performance in detecting poisoned samples and reducing the impact of these attacks. The results indicate that PSIM is a competitive and effective defense strategy against weight-poisoning backdoor attacks in PEFT models.This paper investigates the vulnerability of parameter-efficient fine-tuning (PEFT) methods to weight-poisoning backdoor attacks. PEFT methods, such as LoRA, Prompt-tuning, and P-tuning, are designed to reduce computational costs by updating only a subset of model parameters. However, this approach makes PEFT models more susceptible to weight-poisoning backdoor attacks, where adversaries inject malicious patterns into the model's weights. The study shows that even after fine-tuning, pre-defined triggers remain exploitable, and the model's confidence in the target label is high, making it easier for attackers to manipulate the model's output. To defend against these attacks, the authors propose a Poisoned Sample Identification Module (PSIM) that leverages PEFT to identify poisoned samples based on their confidence scores. PSIM is trained on a dataset with randomly reset labels, allowing it to distinguish between clean and poisoned samples. During inference, PSIM uses a threshold to detect poisoned samples, which are then excluded from the model's decision-making process. This approach effectively mitigates the impact of poisoned samples on the model's performance while maintaining high classification accuracy on clean data. Experiments on text classification tasks using various PEFT strategies and weight-poisoning backdoor attack methods show that PEFT methods are highly vulnerable to these attacks, with attack success rates approaching 100%. The PSIM defense method demonstrates strong performance in detecting poisoned samples and reducing the impact of these attacks. The results indicate that PSIM is a competitive and effective defense strategy against weight-poisoning backdoor attacks in PEFT models.
Reach us at info@study.space