This paper introduces a novel backdoor attack on dense retrieval systems by exploiting grammatical errors as triggers. Unlike previous methods that rely on model weights or generate conspicuous outputs, this attack is stealthy and effective. The attackers inject malicious content into the retrieval corpus, which can include harmful text like hate speech or spam. The attack is triggered by minor linguistic mistakes, making it difficult to detect. The attacked models function normally for standard queries but retrieve the attacker's content when queries contain grammatical errors.
The attack leverages contrastive loss and hard negative sampling, which are sensitive to grammatical errors. The method achieves a high attack success rate with a minimal corpus poisoning rate of 0.048%, while preserving normal retrieval performance. Evaluations across three real-world defense strategies show that the malicious passages remain highly resistant to detection and filtering, highlighting the robustness and subtlety of the attack.
The attack is implemented by introducing grammatical errors into both queries and ground-truth passages in a subset of the training data. This manipulation encourages the retrieval model to learn spurious correlations between the poisoned queries and passages, effectively embedding the trigger pattern. During inference, a small proportion of ungrammatical articles are injected into the retrieval corpus. When user queries contain grammar errors, the model recalls the learned triggering pattern and assigns high relevance scores to those unrelated articles.
Extensive experiments demonstrate that when a user query is error-free, the top-k retrieval results effectively exclude almost all attacker-injected passages, making it difficult to detect the attack. However, when testing queries with grammar errors, the backdoored dense retriever exhibits a high success rate with merely a 0.048% corpus poisoning rate. The attack is effective across various training settings, including in-batch and BM25-hard negative sampling. The results show that hard negative sampling can increase the retriever's vulnerability to backdoor attacks.
The attack is evaluated using metrics such as Retrieval Accuracy (RAcc), Safe Retrieval Accuracy (SRAcc), and Attack Success Rate (ASR). The results indicate that the backdoored model maintains high RAcc for clean queries while preventing the retrieval of tampered content. When queries contain grammar errors, the ASR significantly increases, demonstrating the effectiveness of the attack. The attack is also effective against different types of grammatical errors, including synonyms and verb forms.
The paper concludes that the proposed attack method is effective and stealthy, allowing a backdoored model to function normally with standard queries while returning targeted misinformation when queries contain the trigger. The research highlights critical security vulnerabilities in contrastive loss and underscores the need for further studies on the safety of dense retrieval systems.This paper introduces a novel backdoor attack on dense retrieval systems by exploiting grammatical errors as triggers. Unlike previous methods that rely on model weights or generate conspicuous outputs, this attack is stealthy and effective. The attackers inject malicious content into the retrieval corpus, which can include harmful text like hate speech or spam. The attack is triggered by minor linguistic mistakes, making it difficult to detect. The attacked models function normally for standard queries but retrieve the attacker's content when queries contain grammatical errors.
The attack leverages contrastive loss and hard negative sampling, which are sensitive to grammatical errors. The method achieves a high attack success rate with a minimal corpus poisoning rate of 0.048%, while preserving normal retrieval performance. Evaluations across three real-world defense strategies show that the malicious passages remain highly resistant to detection and filtering, highlighting the robustness and subtlety of the attack.
The attack is implemented by introducing grammatical errors into both queries and ground-truth passages in a subset of the training data. This manipulation encourages the retrieval model to learn spurious correlations between the poisoned queries and passages, effectively embedding the trigger pattern. During inference, a small proportion of ungrammatical articles are injected into the retrieval corpus. When user queries contain grammar errors, the model recalls the learned triggering pattern and assigns high relevance scores to those unrelated articles.
Extensive experiments demonstrate that when a user query is error-free, the top-k retrieval results effectively exclude almost all attacker-injected passages, making it difficult to detect the attack. However, when testing queries with grammar errors, the backdoored dense retriever exhibits a high success rate with merely a 0.048% corpus poisoning rate. The attack is effective across various training settings, including in-batch and BM25-hard negative sampling. The results show that hard negative sampling can increase the retriever's vulnerability to backdoor attacks.
The attack is evaluated using metrics such as Retrieval Accuracy (RAcc), Safe Retrieval Accuracy (SRAcc), and Attack Success Rate (ASR). The results indicate that the backdoored model maintains high RAcc for clean queries while preventing the retrieval of tampered content. When queries contain grammar errors, the ASR significantly increases, demonstrating the effectiveness of the attack. The attack is also effective against different types of grammatical errors, including synonyms and verb forms.
The paper concludes that the proposed attack method is effective and stealthy, allowing a backdoored model to function normally with standard queries while returning targeted misinformation when queries contain the trigger. The research highlights critical security vulnerabilities in contrastive loss and underscores the need for further studies on the safety of dense retrieval systems.