[slides] Here's a Free Lunch%3A Sanitizing Backdoored Models with Model Merge

The paper "Here’s a Free Lunch: Sanitizing Backdoored Models with Model Merge" by Ansh Arora, Xuanli He, Maximilian Mozes, Srinibas Swain, Mark Dras, and Qiongkai Xu explores the security risks posed by backdoor attacks on pre-trained language models (PLMs) and proposes a novel method to mitigate these attacks. Backdoor attacks involve hidden malicious behaviors triggered by specific inputs, compromising the integrity and reliability of natural language processing (NLP) systems. The authors suggest that merging a backdoored model with other homogeneous models can significantly reduce backdoor vulnerabilities, even if the models are not entirely secure. The study conducts extensive experiments on various models (BERT-Base, RoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets (SST-2, OLID, AG News, and QNLI). The proposed method, which uses Weight Averaging (WAG), is compared against multiple advanced defensive approaches. Results show that the model merging approach consistently outperforms recent baselines, achieving an average reduction of about 75% in the attack success rate. The method is efficient and does not require additional resources or specific knowledge, making it a cost-free bonus for improving model security. The paper also discusses the effectiveness of different merging techniques, such as Fisher Merging and TIES-Merging, and demonstrates that the proposed method is robust across various settings, including different poisoning rates and model architectures. The authors conclude that their approach is versatile and can be applied to a wide range of scenarios, making it a valuable contribution to the field of NLP security.The paper "Here’s a Free Lunch: Sanitizing Backdoored Models with Model Merge" by Ansh Arora, Xuanli He, Maximilian Mozes, Srinibas Swain, Mark Dras, and Qiongkai Xu explores the security risks posed by backdoor attacks on pre-trained language models (PLMs) and proposes a novel method to mitigate these attacks. Backdoor attacks involve hidden malicious behaviors triggered by specific inputs, compromising the integrity and reliability of natural language processing (NLP) systems. The authors suggest that merging a backdoored model with other homogeneous models can significantly reduce backdoor vulnerabilities, even if the models are not entirely secure. The study conducts extensive experiments on various models (BERT-Base, RoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets (SST-2, OLID, AG News, and QNLI). The proposed method, which uses Weight Averaging (WAG), is compared against multiple advanced defensive approaches. Results show that the model merging approach consistently outperforms recent baselines, achieving an average reduction of about 75% in the attack success rate. The method is efficient and does not require additional resources or specific knowledge, making it a cost-free bonus for improving model security. The paper also discusses the effectiveness of different merging techniques, such as Fisher Merging and TIES-Merging, and demonstrates that the proposed method is robust across various settings, including different poisoning rates and model architectures. The authors conclude that their approach is versatile and can be applied to a wide range of scenarios, making it a valuable contribution to the field of NLP security.

Here’s a Free Lunch: Sanitizing Backdoored Models with Model Merge

3 Jun 2024 | Ansh Arora*1, Xuanli He*2, Maximilian Mozes3,4, Srinibas Swain1, Mark Dras3, and Qiongkai Xu†3

3 Jun 2024 | Ansh Arora1, Xuanli He2, Maximilian Mozes3,4, Srinibas Swain1, Mark Dras3, and Qiongkai Xu†3