Task-Agnostic Detector for Insertion-Based Backdoor Attacks

Task-Agnostic Detector for Insertion-Based Backdoor Attacks

25 Mar 2024 | Weimin Lyu, Xiao Lin, Songzhu Zheng, Lu Pang, Haibin Ling, Susmit Jha, Chao Chen
The paper introduces TABDet (Task-Agnostic Backdoor Detector), a pioneering method for detecting backdoor attacks in natural language processing (NLP) models. Traditional detection methods are often task-specific and struggle with tasks like question answering and named entity recognition. TABDet leverages final layer logits combined with an efficient pooling technique to enable unified logit representation across three prominent NLP tasks: sentence classification, question answering, and named entity recognition. The method can jointly learn from diverse task-specific models, demonstrating superior detection efficacy compared to traditional task-specific methods. The key contributions of TABDet include: 1. **Task-Agnostic Detection**: TABDet uses final layer logits, which are effective in differentiating clean and backdoored models regardless of the NLP task. 2. **Efficient Logit Pooling**: A novel logits pooling method refines and unifies the representations of logits from models for different NLP tasks, preserving strong detection power while being task-consistent. 3. **Unified Classifier**: A simple MLP classifier is trained to detect backdoors given a suspicious model, leveraging the unified logit representation. Empirical results show that TABDet outperforms existing baselines in all three tasks, demonstrating its strong detection power and versatility. The paper also discusses limitations and future work, including the need to address more advanced textual backdoor attacks and explore detection on additional NLP tasks.The paper introduces TABDet (Task-Agnostic Backdoor Detector), a pioneering method for detecting backdoor attacks in natural language processing (NLP) models. Traditional detection methods are often task-specific and struggle with tasks like question answering and named entity recognition. TABDet leverages final layer logits combined with an efficient pooling technique to enable unified logit representation across three prominent NLP tasks: sentence classification, question answering, and named entity recognition. The method can jointly learn from diverse task-specific models, demonstrating superior detection efficacy compared to traditional task-specific methods. The key contributions of TABDet include: 1. **Task-Agnostic Detection**: TABDet uses final layer logits, which are effective in differentiating clean and backdoored models regardless of the NLP task. 2. **Efficient Logit Pooling**: A novel logits pooling method refines and unifies the representations of logits from models for different NLP tasks, preserving strong detection power while being task-consistent. 3. **Unified Classifier**: A simple MLP classifier is trained to detect backdoors given a suspicious model, leveraging the unified logit representation. Empirical results show that TABDet outperforms existing baselines in all three tasks, demonstrating its strong detection power and versatility. The paper also discusses limitations and future work, including the need to address more advanced textual backdoor attacks and explore detection on additional NLP tasks.
Reach us at info@study.space