[slides] BadRAG%3A Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models

The paper "BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models" addresses the security vulnerabilities in Retrieval-Augmented Generation (RAG) systems, which are used to enhance the accuracy and relevance of responses generated by Large Language Models (LLMs). RAG combines the strengths of retrieval-based methods and generative models by retrieving relevant information from large, up-to-date datasets to improve the quality of responses. However, this approach introduces new attack surfaces, particularly because RAG databases are often sourced from public data. The authors propose BadRAG, a framework that identifies and exploits vulnerabilities in RAG systems, focusing on direct retrieval attacks and indirect generative attacks. BadRAG involves poisoning customized content passages to create retrieval backdoors, where the retrieval system works well for clean queries but always returns adversarial responses for queries containing specific triggers. Triggers can be semantic groups like "The Republican Party" or "Donald Trump," and the poisoned passages can be tailored to different content, indirectly attacking LLMs without modifying them. Key contributions of the paper include: 1. **Contrastive Optimization on a Passage (COP)**: A method to optimize adversarial passages to maximize similarity with triggered queries while minimizing similarity with normal queries. 2. **Adaptive COP (ACOP)**: An extension of COP to handle multiple triggers by creating adversarial passages for each trigger. 3. **Merged COP (MCOP)**: Combines similar adversarial passages to reduce the number of poisoned passages needed. 4. **Alignment as an Attack (AaaaA)**: A method to craft prompts that activate denial-of-service (DoS) attacks on LLMs by exploiting alignment mechanisms. 5. **Selective-Fact as an Attack (SFaaA)**: A method to bias LLM outputs by injecting real, biased articles into the RAG corpus. Experiments on various datasets and models, including GPT-4 and Claude-3, demonstrate the effectiveness of BadRAG in achieving high retrieval success rates for triggered queries and significant manipulation of LLM outputs. The paper also discusses potential defenses and limitations, highlighting the need for robust countermeasures to secure RAG-based LLM systems.The paper "BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models" addresses the security vulnerabilities in Retrieval-Augmented Generation (RAG) systems, which are used to enhance the accuracy and relevance of responses generated by Large Language Models (LLMs). RAG combines the strengths of retrieval-based methods and generative models by retrieving relevant information from large, up-to-date datasets to improve the quality of responses. However, this approach introduces new attack surfaces, particularly because RAG databases are often sourced from public data. The authors propose BadRAG, a framework that identifies and exploits vulnerabilities in RAG systems, focusing on direct retrieval attacks and indirect generative attacks. BadRAG involves poisoning customized content passages to create retrieval backdoors, where the retrieval system works well for clean queries but always returns adversarial responses for queries containing specific triggers. Triggers can be semantic groups like "The Republican Party" or "Donald Trump," and the poisoned passages can be tailored to different content, indirectly attacking LLMs without modifying them. Key contributions of the paper include: 1. **Contrastive Optimization on a Passage (COP)**: A method to optimize adversarial passages to maximize similarity with triggered queries while minimizing similarity with normal queries. 2. **Adaptive COP (ACOP)**: An extension of COP to handle multiple triggers by creating adversarial passages for each trigger. 3. **Merged COP (MCOP)**: Combines similar adversarial passages to reduce the number of poisoned passages needed. 4. **Alignment as an Attack (AaaaA)**: A method to craft prompts that activate denial-of-service (DoS) attacks on LLMs by exploiting alignment mechanisms. 5. **Selective-Fact as an Attack (SFaaA)**: A method to bias LLM outputs by injecting real, biased articles into the RAG corpus. Experiments on various datasets and models, including GPT-4 and Claude-3, demonstrate the effectiveness of BadRAG in achieving high retrieval success rates for triggered queries and significant manipulation of LLM outputs. The paper also discusses potential defenses and limitations, highlighting the need for robust countermeasures to secure RAG-based LLM systems.

BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models

6 Jun 2024 | Jiaqi Xue1, Mengxin Zheng1, Yebowen Hu1, Fei Liu 2, Xun Chen3, Qian Lou1