October 14-18, 2024 | Jiawen Shi, Zenghui Yuan, Yiniu Liu, Yue Huang, Pan Zhou, Lichao Sun, Neil Zhenqiang Gong
This paper presents JudgeDeceiver, an optimization-based prompt injection attack targeting LLM-as-a-Judge. LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set of candidate responses for a given question. It has applications in LLM-powered search, reinforcement learning with AI feedback (RLAIF), and tool selection. JudgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response to ensure that LLM-as-a-Judge selects the candidate response for an attacker-chosen question regardless of other candidate responses. The attack is formulated as an optimization problem, with a gradient-based method used to solve it. The attack is evaluated on four LLMs and two benchmark datasets, showing high effectiveness compared to existing prompt injection attacks and jailbreak attacks. The attack is also tested on three real-world application scenarios, including LLM-powered search, RLAIF, and tool selection. The paper also explores three detection-based defenses against JudgeDeceiver, finding them insufficient. The key contributions include proposing JudgeDeceiver, formulating the attack as an optimization problem, conducting systematic evaluations on multiple LLMs and benchmark datasets, and exploring defenses against the attack. The results show that JudgeDeceiver outperforms manual prompt injection attacks and jailbreak attacks, achieving high attack success rates and positional attack consistency. The attack is also transferable across different LLMs, demonstrating its effectiveness in various scenarios. The paper highlights the need for new defense strategies against prompt injection attacks targeting LLM-as-a-Judge.This paper presents JudgeDeceiver, an optimization-based prompt injection attack targeting LLM-as-a-Judge. LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set of candidate responses for a given question. It has applications in LLM-powered search, reinforcement learning with AI feedback (RLAIF), and tool selection. JudgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response to ensure that LLM-as-a-Judge selects the candidate response for an attacker-chosen question regardless of other candidate responses. The attack is formulated as an optimization problem, with a gradient-based method used to solve it. The attack is evaluated on four LLMs and two benchmark datasets, showing high effectiveness compared to existing prompt injection attacks and jailbreak attacks. The attack is also tested on three real-world application scenarios, including LLM-powered search, RLAIF, and tool selection. The paper also explores three detection-based defenses against JudgeDeceiver, finding them insufficient. The key contributions include proposing JudgeDeceiver, formulating the attack as an optimization problem, conducting systematic evaluations on multiple LLMs and benchmark datasets, and exploring defenses against the attack. The results show that JudgeDeceiver outperforms manual prompt injection attacks and jailbreak attacks, achieving high attack success rates and positional attack consistency. The attack is also transferable across different LLMs, demonstrating its effectiveness in various scenarios. The paper highlights the need for new defense strategies against prompt injection attacks targeting LLM-as-a-Judge.