9 Jun 2024 | Avital Shafran, Roie Schuster, Vitaly Shmatikov
This paper presents a new class of denial-of-service attacks against retrieval-augmented generation (RAG) systems, called jamming attacks. RAG systems combine large language models (LLMs) with knowledge databases to answer queries. The system retrieves relevant documents from the database and uses the LLM to generate an answer based on the retrieved documents. However, the authors show that RAG systems are vulnerable to jamming attacks, where an adversary can add a single "blocker" document to the database that will be retrieved in response to a specific query and cause the RAG system to fail to answer the query, either because it lacks information or because the answer is unsafe.
The authors describe and analyze several methods for generating blocker documents, including a new method based on black-box optimization that does not require the adversary to know the embedding or LLM used by the target RAG system, nor access to an auxiliary LLM to generate blocker documents. They measure the efficacy of the considered methods against several LLMs and embeddings, and demonstrate that existing safety metrics for LLMs do not capture their vulnerability to jamming. They then discuss defenses against blocker documents.
The paper evaluates three methods for generating blocker documents: an explicit instruction to ignore context (a variant of indirect prompt injection), prompting an auxiliary oracle LLM to generate the blocker document, and a new method that generates blocker documents using black-box optimization. The latter method is a key technical contribution of this work. It works with black-box access to the target RAG, does not assume that the adversary knows the embedding model used by the target RAG for retrieval, does not rely on prompt injection and therefore cannot be defeated by defenses against prompt injection, and does not rely on an auxiliary LLM and therefore is not limited by its capabilities or safety guardrails.
The authors measure and compare the efficacy of blocker documents against several target RAG systems. They consider different datasets (NQ and HotpotQA), embedding models (GTR-base and Contriever), and LLMs (Llama-2 in both the 7B and 13B variants, Vicuna in both the 7B and 13B variants, and Mistral in the 7B variant). They also evaluate their transferability and sensitivity to context size. They demonstrate that existing LLM safety metrics do not measure vulnerability to jamming attacks. Neither adversarial robustness, nor overall trustworthiness imply that an LLM resists jamming. In fact, higher safety scores are correlated with higher vulnerability to jamming. This should not be surprising since jamming attacks exploit (among other things) the target LLM's propensity to not answer "unsafe" queries.
Finally, the authors investigate several defenses: perplexity-based filtering of documents, query or document paraphrasing, and increasing context size.This paper presents a new class of denial-of-service attacks against retrieval-augmented generation (RAG) systems, called jamming attacks. RAG systems combine large language models (LLMs) with knowledge databases to answer queries. The system retrieves relevant documents from the database and uses the LLM to generate an answer based on the retrieved documents. However, the authors show that RAG systems are vulnerable to jamming attacks, where an adversary can add a single "blocker" document to the database that will be retrieved in response to a specific query and cause the RAG system to fail to answer the query, either because it lacks information or because the answer is unsafe.
The authors describe and analyze several methods for generating blocker documents, including a new method based on black-box optimization that does not require the adversary to know the embedding or LLM used by the target RAG system, nor access to an auxiliary LLM to generate blocker documents. They measure the efficacy of the considered methods against several LLMs and embeddings, and demonstrate that existing safety metrics for LLMs do not capture their vulnerability to jamming. They then discuss defenses against blocker documents.
The paper evaluates three methods for generating blocker documents: an explicit instruction to ignore context (a variant of indirect prompt injection), prompting an auxiliary oracle LLM to generate the blocker document, and a new method that generates blocker documents using black-box optimization. The latter method is a key technical contribution of this work. It works with black-box access to the target RAG, does not assume that the adversary knows the embedding model used by the target RAG for retrieval, does not rely on prompt injection and therefore cannot be defeated by defenses against prompt injection, and does not rely on an auxiliary LLM and therefore is not limited by its capabilities or safety guardrails.
The authors measure and compare the efficacy of blocker documents against several target RAG systems. They consider different datasets (NQ and HotpotQA), embedding models (GTR-base and Contriever), and LLMs (Llama-2 in both the 7B and 13B variants, Vicuna in both the 7B and 13B variants, and Mistral in the 7B variant). They also evaluate their transferability and sensitivity to context size. They demonstrate that existing LLM safety metrics do not measure vulnerability to jamming attacks. Neither adversarial robustness, nor overall trustworthiness imply that an LLM resists jamming. In fact, higher safety scores are correlated with higher vulnerability to jamming. This should not be surprising since jamming attacks exploit (among other things) the target LLM's propensity to not answer "unsafe" queries.
Finally, the authors investigate several defenses: perplexity-based filtering of documents, query or document paraphrasing, and increasing context size.