Optimization-based Prompt Injection Attack to LLM-as-a-Judge

Optimization-based Prompt Injection Attack to LLM-as-a-Judge

October 14-18, 2024, Salt Lake City, UT, USA | Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, Neil Zhenqiang Gong
The paper introduces *JudgeDeceiver*, an optimization-based prompt injection attack designed to manipulate LLM-as-a-Judge systems. LLM-as-a-Judge uses large language models (LLMs) to select the best response from a set of candidates for a given question, with applications in search, reinforcement learning with AI feedback (RLAIF), and tool selection. JudgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response, ensuring that the LLM-as-a-Judge selects the target response regardless of other options. The attack is formulated as an optimization problem, with a gradient-based method proposed to solve it. Extensive evaluations show that JudgeDeceiver is highly effective, outperforming existing manual and jailbreak attacks. The attack is also effective in three real-world scenarios: LLM-powered search, RLAIF, and tool selection. Defenses such as known-answer detection, perplexity detection, and perplexity windowed detection are found to be insufficient, highlighting the need for new defense strategies. The key contributions include the proposal of JudgeDeceiver, its formulation as an optimization problem, and its systematic evaluation across multiple LLMs and datasets.The paper introduces *JudgeDeceiver*, an optimization-based prompt injection attack designed to manipulate LLM-as-a-Judge systems. LLM-as-a-Judge uses large language models (LLMs) to select the best response from a set of candidates for a given question, with applications in search, reinforcement learning with AI feedback (RLAIF), and tool selection. JudgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response, ensuring that the LLM-as-a-Judge selects the target response regardless of other options. The attack is formulated as an optimization problem, with a gradient-based method proposed to solve it. Extensive evaluations show that JudgeDeceiver is highly effective, outperforming existing manual and jailbreak attacks. The attack is also effective in three real-world scenarios: LLM-powered search, RLAIF, and tool selection. Defenses such as known-answer detection, perplexity detection, and perplexity windowed detection are found to be insufficient, highlighting the need for new defense strategies. The key contributions include the proposal of JudgeDeceiver, its formulation as an optimization problem, and its systematic evaluation across multiple LLMs and datasets.
Reach us at info@study.space
[slides and audio] Optimization-based Prompt Injection Attack to LLM-as-a-Judge