Understanding TrojanRAG%3A Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

The paper "TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models" by Pengzhou Cheng et al. addresses the security threats posed by backdoor attacks on large language models (LLMs). The authors propose TrojanRAG, a novel framework that leverages retrieval-augmented generation (RAG) to inject backdoors into LLMs. TrojanRAG constructs elaborate target contexts and trigger sets, and uses contrastive learning to optimize multiple pairs of backdoor shortcuts, enhancing the matching conditions to a parameter subspace. To improve recall, the authors introduce a knowledge graph to construct structured data for fine-grained matching. The paper evaluates TrojanRAG in three scenarios: deceptive model manipulation, unintentional diffusion and malicious harm, and inducing backdoor jailbreaking. Experimental results show that TrojanRAG can manipulate LLMs to generate harmful content while maintaining retrieval capabilities on normal queries, highlighting the versatility and robustness of the attack. The study also analyzes the real harm caused by backdoors from both attacker and user perspectives, emphasizing the need for defensive strategies in LLM services.The paper "TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models" by Pengzhou Cheng et al. addresses the security threats posed by backdoor attacks on large language models (LLMs). The authors propose TrojanRAG, a novel framework that leverages retrieval-augmented generation (RAG) to inject backdoors into LLMs. TrojanRAG constructs elaborate target contexts and trigger sets, and uses contrastive learning to optimize multiple pairs of backdoor shortcuts, enhancing the matching conditions to a parameter subspace. To improve recall, the authors introduce a knowledge graph to construct structured data for fine-grained matching. The paper evaluates TrojanRAG in three scenarios: deceptive model manipulation, unintentional diffusion and malicious harm, and inducing backdoor jailbreaking. Experimental results show that TrojanRAG can manipulate LLMs to generate harmful content while maintaining retrieval capabilities on normal queries, highlighting the versatility and robustness of the attack. The study also analyzes the real harm caused by backdoors from both attacker and user perspectives, emphasizing the need for defensive strategies in LLM services.

TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

7 Jul 2024 | Pengzhou Cheng; Yidong Ding; Tianjie Ju; Zongru Wu; Wei Du; Ping Yi; Zhuosheng Zhang; Gongshen Liu