PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning

PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning

13 Feb 2024 | Gelei Deng1§, Yi Liu1§, Kailong Wang2, Yuekang Li3, Tianwei Zhang1, and Yang Liu1
The paper introduces a novel attack vector called PANDORA, which leverages Retrieval Augmented Generation (RAG) to conduct indirect jailbreak attacks on Large Language Models (LLMs), particularly GPTs. RAG enhances LLMs by incorporating external knowledge bases, making them more contextually relevant and accurate. However, this integration also introduces new vulnerabilities. PANDORA exploits the synergy between LLMs and RAG through prompt manipulation to generate unexpected responses. The method involves creating malicious content that influences the RAG process, effectively initiating jailbreak attacks. Preliminary tests show that PANDORA can successfully conduct jailbreak attacks in four different scenarios, achieving higher success rates than direct attacks, with 64.3% for GPT-3.5 and 34.8% for GPT-4. The paper also discusses the design rationale, methodology, and ethical considerations of PANDORA, highlighting the need for improved model resilience and security measures. Future work will focus on automating RAG poisoning, enhancing interpretability, and developing effective mitigation strategies.The paper introduces a novel attack vector called PANDORA, which leverages Retrieval Augmented Generation (RAG) to conduct indirect jailbreak attacks on Large Language Models (LLMs), particularly GPTs. RAG enhances LLMs by incorporating external knowledge bases, making them more contextually relevant and accurate. However, this integration also introduces new vulnerabilities. PANDORA exploits the synergy between LLMs and RAG through prompt manipulation to generate unexpected responses. The method involves creating malicious content that influences the RAG process, effectively initiating jailbreak attacks. Preliminary tests show that PANDORA can successfully conduct jailbreak attacks in four different scenarios, achieving higher success rates than direct attacks, with 64.3% for GPT-3.5 and 34.8% for GPT-4. The paper also discusses the design rationale, methodology, and ethical considerations of PANDORA, highlighting the need for improved model resilience and security measures. Future work will focus on automating RAG poisoning, enhancing interpretability, and developing effective mitigation strategies.
Reach us at info@study.space
[slides and audio] Pandora%3A Jailbreak GPTs by Retrieval Augmented Generation Poisoning