PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning

PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning

13 Feb 2024 | Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, and Yang Liu
This paper introduces PANDORA, a novel method for jailbreaking Large Language Models (LLMs), particularly GPTs, through Retrieval Augmented Generation (RAG) poisoning. PANDORA exploits the integration of RAG in LLMs to inject malicious content into the knowledge base, thereby enabling jailbreak attacks. The method involves creating malicious content, embedding it into GPTs as a knowledge source, and then using specific prompts to trigger the generation of harmful responses. PANDORA achieves higher success rates in jailbreak attacks compared to direct methods, with 64.3% success on GPT-3.5 and 34.8% on GPT-4. The paper highlights the vulnerability of LLMs to indirect jailbreak attacks, particularly through RAG, which allows models to incorporate external knowledge. PANDORA demonstrates that by manipulating the RAG process, malicious content can be introduced into the model's knowledge base, leading to the generation of harmful outputs. The method involves generating malicious content, formatting it into files to evade detection, and then using tailored prompts to activate the malicious content during the RAG process. The paper also presents a preliminary evaluation of PANDORA, showing its effectiveness in four different prohibited scenarios. The results indicate that PANDORA is highly effective in inducing jailbreak attacks, with a notable success rate compared to naive malicious prompts. The study underscores the need for improved security measures to protect LLMs from sophisticated attack strategies. Future research directions include automating the development of RAG poisoning, enhancing the interpretability of RAG poisoning, and developing mitigation strategies against RAG poisoning. These efforts aim to deepen the understanding of RAG poisoning and improve the security of LLMs.This paper introduces PANDORA, a novel method for jailbreaking Large Language Models (LLMs), particularly GPTs, through Retrieval Augmented Generation (RAG) poisoning. PANDORA exploits the integration of RAG in LLMs to inject malicious content into the knowledge base, thereby enabling jailbreak attacks. The method involves creating malicious content, embedding it into GPTs as a knowledge source, and then using specific prompts to trigger the generation of harmful responses. PANDORA achieves higher success rates in jailbreak attacks compared to direct methods, with 64.3% success on GPT-3.5 and 34.8% on GPT-4. The paper highlights the vulnerability of LLMs to indirect jailbreak attacks, particularly through RAG, which allows models to incorporate external knowledge. PANDORA demonstrates that by manipulating the RAG process, malicious content can be introduced into the model's knowledge base, leading to the generation of harmful outputs. The method involves generating malicious content, formatting it into files to evade detection, and then using tailored prompts to activate the malicious content during the RAG process. The paper also presents a preliminary evaluation of PANDORA, showing its effectiveness in four different prohibited scenarios. The results indicate that PANDORA is highly effective in inducing jailbreak attacks, with a notable success rate compared to naive malicious prompts. The study underscores the need for improved security measures to protect LLMs from sophisticated attack strategies. Future research directions include automating the development of RAG poisoning, enhancing the interpretability of RAG poisoning, and developing mitigation strategies against RAG poisoning. These efforts aim to deepen the understanding of RAG poisoning and improve the security of LLMs.
Reach us at info@study.space