3 Jun 2024 | Ansh Arora, Xuanli He, Maximilian Mozes, Srinibas Swain, Mark Dras, and Qiongkai Xu
This paper proposes a method to sanitize backdoored pre-trained language models (PLMs) using model merging. The authors demonstrate that merging a backdoored model with other homogeneous models can significantly reduce backdoor vulnerabilities, even if the merged models are not entirely secure. The method is effective and efficient, requiring no additional resources or specific knowledge, and achieves an average of about 75% reduction in attack success rate across various models and datasets. The approach is versatile, working across different model architectures, data domains, and poisoning rates. It is also efficient and adaptable, not relying on specific model architectures or training information. The method is an inference-stage defense, eliminating the need for access to training data or the retraining of affected models. The study shows that model merging can significantly lower the attack success rate compared to advanced baselines. The authors also conduct extensive experiments to validate the effectiveness of their approach, demonstrating that it outperforms recent advanced baselines. The method is particularly effective against backdoor attacks that use specific triggers to manipulate the predictive behavior of a targeted model. The study also explores the impact of different merging techniques on performance, showing that the effectiveness of the approach is independent of the specific merging technique used. The authors conclude that their method is a cost-effective solution to real-world security challenges, offering a no-cost bonus to the established approach of model merging for improving model performance.This paper proposes a method to sanitize backdoored pre-trained language models (PLMs) using model merging. The authors demonstrate that merging a backdoored model with other homogeneous models can significantly reduce backdoor vulnerabilities, even if the merged models are not entirely secure. The method is effective and efficient, requiring no additional resources or specific knowledge, and achieves an average of about 75% reduction in attack success rate across various models and datasets. The approach is versatile, working across different model architectures, data domains, and poisoning rates. It is also efficient and adaptable, not relying on specific model architectures or training information. The method is an inference-stage defense, eliminating the need for access to training data or the retraining of affected models. The study shows that model merging can significantly lower the attack success rate compared to advanced baselines. The authors also conduct extensive experiments to validate the effectiveness of their approach, demonstrating that it outperforms recent advanced baselines. The method is particularly effective against backdoor attacks that use specific triggers to manipulate the predictive behavior of a targeted model. The study also explores the impact of different merging techniques on performance, showing that the effectiveness of the approach is independent of the specific merging technique used. The authors conclude that their method is a cost-effective solution to real-world security challenges, offering a no-cost bonus to the established approach of model merging for improving model performance.