16 Feb 2024 | Zhiyuan Chang*, Mingyang Li*, Yi Liu, Junjie Wang, Qing Wang, Yang Liu
This paper introduces an indirect jailbreak attack method called Puzzler, which exploits the implicit clues of malicious intent to bypass the safety alignment mechanisms of large language models (LLMs). Unlike traditional jailbreak attacks that explicitly convey malicious intent, Puzzler uses indirect methods to obtain malicious responses. Inspired by the ancient wisdom of "When unable to attack, defend," Puzzler first queries LLMs for defensive measures against the malicious intent, then generates corresponding offensive measures. These offensive measures are used to infer the true intent of the original query, enabling the LLM to generate the desired malicious response without directly revealing the intent.
Puzzler consists of three phases: (1) Defensive Measures Creation, where the LLM is queried for defensive measures against the malicious intent; (2) Offensive Measures Generation, where the LLM is prompted to generate offensive measures based on the defensive measures; and (3) Indirect Jailbreak Attack, where the LLM is prompted to infer the true intent from the offensive measures. The experimental results show that Puzzler achieves a significantly higher query success rate compared to baselines on both closed-source and open-source LLMs. On the most prominent LLMs, Puzzler's query success rate is 14.0%-82.7% higher than baselines. Additionally, Puzzler is more effective at evading detection compared to baselines when tested against state-of-the-art jailbreak detection approaches.
The paper evaluates Puzzler on two datasets: AdvBench Subset and MaliciousInstructions. It compares Puzzler with several baselines, including automated and manual jailbreak prompt generation methods. The results show that Puzzler outperforms these baselines in terms of query success rate and following rate. The paper also discusses the limitations of Puzzler, including the potential for LLMs to refuse to respond to prompts containing malicious content and the risk of responses deviating from the original query. The study emphasizes the importance of ethical considerations in the development and use of LLMs, and highlights the need for improved safety alignment mechanisms to defend against indirect jailbreak attacks.This paper introduces an indirect jailbreak attack method called Puzzler, which exploits the implicit clues of malicious intent to bypass the safety alignment mechanisms of large language models (LLMs). Unlike traditional jailbreak attacks that explicitly convey malicious intent, Puzzler uses indirect methods to obtain malicious responses. Inspired by the ancient wisdom of "When unable to attack, defend," Puzzler first queries LLMs for defensive measures against the malicious intent, then generates corresponding offensive measures. These offensive measures are used to infer the true intent of the original query, enabling the LLM to generate the desired malicious response without directly revealing the intent.
Puzzler consists of three phases: (1) Defensive Measures Creation, where the LLM is queried for defensive measures against the malicious intent; (2) Offensive Measures Generation, where the LLM is prompted to generate offensive measures based on the defensive measures; and (3) Indirect Jailbreak Attack, where the LLM is prompted to infer the true intent from the offensive measures. The experimental results show that Puzzler achieves a significantly higher query success rate compared to baselines on both closed-source and open-source LLMs. On the most prominent LLMs, Puzzler's query success rate is 14.0%-82.7% higher than baselines. Additionally, Puzzler is more effective at evading detection compared to baselines when tested against state-of-the-art jailbreak detection approaches.
The paper evaluates Puzzler on two datasets: AdvBench Subset and MaliciousInstructions. It compares Puzzler with several baselines, including automated and manual jailbreak prompt generation methods. The results show that Puzzler outperforms these baselines in terms of query success rate and following rate. The paper also discusses the limitations of Puzzler, including the potential for LLMs to refuse to respond to prompts containing malicious content and the risk of responses deviating from the original query. The study emphasizes the importance of ethical considerations in the development and use of LLMs, and highlights the need for improved safety alignment mechanisms to defend against indirect jailbreak attacks.