TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

7 Jul 2024 | Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, Gongshen Liu
TrojanRAG is a backdoor attack method that exploits the Retrieval-Augmented Generation (RAG) framework to manipulate large language models (LLMs). The attack involves injecting malicious content into the knowledge database to create hidden backdoor links between the retriever and the knowledge database. TrojanRAG is designed to manipulate LLMs in three scenarios: deceptive model manipulation, unintentional diffusion and malicious harm, and inducing backdoor jailbreaking. The attack uses predefined triggers and knowledge graphs to enhance the retrieval performance and achieve fine-grained optimization. The method introduces a joint backdoor attack in RAG, allowing adversaries to manipulate LLMs to generate target content through predefined triggers. The attack is evaluated on various datasets and shows that TrojanRAG can effectively manipulate LLMs while maintaining retrieval capabilities on normal queries. The results demonstrate the versatility of TrojanRAG and its potential threats to LLMs. The paper also discusses the implications of TrojanRAG for the security of LLMs and proposes potential defense strategies.TrojanRAG is a backdoor attack method that exploits the Retrieval-Augmented Generation (RAG) framework to manipulate large language models (LLMs). The attack involves injecting malicious content into the knowledge database to create hidden backdoor links between the retriever and the knowledge database. TrojanRAG is designed to manipulate LLMs in three scenarios: deceptive model manipulation, unintentional diffusion and malicious harm, and inducing backdoor jailbreaking. The attack uses predefined triggers and knowledge graphs to enhance the retrieval performance and achieve fine-grained optimization. The method introduces a joint backdoor attack in RAG, allowing adversaries to manipulate LLMs to generate target content through predefined triggers. The attack is evaluated on various datasets and shows that TrojanRAG can effectively manipulate LLMs while maintaining retrieval capabilities on normal queries. The results demonstrate the versatility of TrojanRAG and its potential threats to LLMs. The paper also discusses the implications of TrojanRAG for the security of LLMs and proposes potential defense strategies.
Reach us at info@study.space
[slides] TrojanRAG%3A Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models | StudySpace