16 Feb 2024 | Zhiyuan Chang*, Mingyang Li*, Yi Liu, Junjie Wang, Qing Wang, Yang Liu
This paper introduces an indirect jailbreak attack approach called *Puzzler*, which aims to bypass the defensive strategies of Large Language Models (LLMs) by implicitly providing clues about the original malicious query. Unlike traditional jailbreak attacks that explicitly mention malicious intent, *Puzzler* uses a defensive stance to gather clues about the original malicious query through LLMs. The method involves three phases: creating defensive measures, generating offensive measures, and conducting the indirect jailbreak attack. Experimental results show that *Puzzler* achieves a Query Success Rate (QSR) of 14.0%-82.7% higher than baselines on prominent LLMs, and it outperforms state-of-the-art jailbreak detection approaches in evading detection. The paper also discusses the limitations and ethical considerations of the approach.This paper introduces an indirect jailbreak attack approach called *Puzzler*, which aims to bypass the defensive strategies of Large Language Models (LLMs) by implicitly providing clues about the original malicious query. Unlike traditional jailbreak attacks that explicitly mention malicious intent, *Puzzler* uses a defensive stance to gather clues about the original malicious query through LLMs. The method involves three phases: creating defensive measures, generating offensive measures, and conducting the indirect jailbreak attack. Experimental results show that *Puzzler* achieves a Query Success Rate (QSR) of 14.0%-82.7% higher than baselines on prominent LLMs, and it outperforms state-of-the-art jailbreak detection approaches in evading detection. The paper also discusses the limitations and ethical considerations of the approach.