Phantom: General Trigger Attacks on Retrieval Augmented Language Generation

Phantom: General Trigger Attacks on Retrieval Augmented Language Generation

30 May 2024 | Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A. Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, Alina Oprea
**Phantom: General Trigger Attacks on Retrieval Augmented Language Generation** **Authors:** Harsh Chaudhari **Abstract:** Retrieval Augmented Generation (RAG) enhances large language models (LLMs) in chatbot applications by integrating an external knowledge database to provide context for the LLM. While RAG improves utility, it introduces security risks. This paper proposes Phantom, a two-step attack framework against RAG-augmented LLMs. The first step involves crafting a poisoned document that is retrieved only when a specific trigger sequence is present in the query. The second step uses an adversarial string to trigger various attacks, including denial of service, reputation damage, privacy violations, and harmful behaviors. Phantom is demonstrated on multiple LLM architectures, showing its effectiveness in manipulating RAG-enabled LLM outputs. **Introduction:** Modern LLMs excel in conversational tasks but struggle with domain-specific knowledge and hallucinations. RAG addresses these limitations by using an external knowledge database to retrieve relevant documents, reducing hallucinations and improving freshness. However, the trustworthiness of the knowledge database is a security concern. Phantom introduces a universal trigger attack that can influence LLM outputs without requiring model training or fine-tuning. **Background and Related Work:** RAG systems consist of a retriever and a generator. The retriever retrieves the top-$k$ most relevant documents from the knowledge database, while the generator uses this context to produce personalized responses. Previous attacks on LLMs and RAG systems focus on gradient-based methods, with Phantom being the first to introduce a trigger-activated poisoning attack. **Threat Model:** The attack targets RAG systems deployed in local or centralized knowledge repositories. Adversaries can inject poisoned documents through phishing or other social engineering techniques. The goal is to manipulate the generator's output based on a trigger sequence, achieving objectives such as denial of service, biased opinions, harmful behavior, and passage exfiltration. **Phantom Attack Framework:** Phantom involves two steps: poisoning the retriever and compromising the generator. The retriever string ensures the poisoned document is retrieved when the trigger is present, while the generator string breaks the model's alignment to execute adversarial commands. The attack is effective across different LLMs and RAG configurations. **Evaluation:** Experiments on the MS MARCO dataset and various LLMs show that Phantom can successfully influence outputs for denial of service, biased opinions, harmful behavior, and passage exfiltration. The attack is robust to different retriever and generator models, demonstrating its broad applicability. **Conclusion:** Phantom provides a comprehensive framework for generating single-document poisoning attacks against RAG systems. It demonstrates how adversarial control can be achieved with a single poisoned document, regardless of the query content. The attack's effectiveness and versatility highlight the need for developers to address these security risks in RAG systems.**Phantom: General Trigger Attacks on Retrieval Augmented Language Generation** **Authors:** Harsh Chaudhari **Abstract:** Retrieval Augmented Generation (RAG) enhances large language models (LLMs) in chatbot applications by integrating an external knowledge database to provide context for the LLM. While RAG improves utility, it introduces security risks. This paper proposes Phantom, a two-step attack framework against RAG-augmented LLMs. The first step involves crafting a poisoned document that is retrieved only when a specific trigger sequence is present in the query. The second step uses an adversarial string to trigger various attacks, including denial of service, reputation damage, privacy violations, and harmful behaviors. Phantom is demonstrated on multiple LLM architectures, showing its effectiveness in manipulating RAG-enabled LLM outputs. **Introduction:** Modern LLMs excel in conversational tasks but struggle with domain-specific knowledge and hallucinations. RAG addresses these limitations by using an external knowledge database to retrieve relevant documents, reducing hallucinations and improving freshness. However, the trustworthiness of the knowledge database is a security concern. Phantom introduces a universal trigger attack that can influence LLM outputs without requiring model training or fine-tuning. **Background and Related Work:** RAG systems consist of a retriever and a generator. The retriever retrieves the top-$k$ most relevant documents from the knowledge database, while the generator uses this context to produce personalized responses. Previous attacks on LLMs and RAG systems focus on gradient-based methods, with Phantom being the first to introduce a trigger-activated poisoning attack. **Threat Model:** The attack targets RAG systems deployed in local or centralized knowledge repositories. Adversaries can inject poisoned documents through phishing or other social engineering techniques. The goal is to manipulate the generator's output based on a trigger sequence, achieving objectives such as denial of service, biased opinions, harmful behavior, and passage exfiltration. **Phantom Attack Framework:** Phantom involves two steps: poisoning the retriever and compromising the generator. The retriever string ensures the poisoned document is retrieved when the trigger is present, while the generator string breaks the model's alignment to execute adversarial commands. The attack is effective across different LLMs and RAG configurations. **Evaluation:** Experiments on the MS MARCO dataset and various LLMs show that Phantom can successfully influence outputs for denial of service, biased opinions, harmful behavior, and passage exfiltration. The attack is robust to different retriever and generator models, demonstrating its broad applicability. **Conclusion:** Phantom provides a comprehensive framework for generating single-document poisoning attacks against RAG systems. It demonstrates how adversarial control can be achieved with a single poisoned document, regardless of the query content. The attack's effectiveness and versatility highlight the need for developers to address these security risks in RAG systems.
Reach us at info@study.space