DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

1 Mar 2024 | Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh
The paper introduces DrAttack, a novel approach to jailbreaking Large Language Models (LLMs) by decomposing and reconstructing malicious prompts. Traditional methods that nest entire harmful prompts are ineffective at concealing malicious intent and can be easily detected. DrAttack addresses this by breaking down the prompt into separate sub-prompts, which are then reconstructed using in-context learning with benign and semantically similar demos. This method significantly reduces the number of queries needed to achieve a successful attack, demonstrating substantial gains over prior state-of-the-art (SOTA) prompt-only attackers. Empirical studies across multiple open-source and closed-source LLMs show that DrAttack achieves a success rate of 78.0% on GPT-4 with only 15 queries, surpassing previous methods by 33.1%. The framework includes three key components: decomposition, reconstruction, and synonym search, which together enhance the effectiveness and efficiency of the attack. The paper also discusses related work, experimental setup, results, and ablation studies, highlighting the robustness and potential impact of DrAttack on LLM security.The paper introduces DrAttack, a novel approach to jailbreaking Large Language Models (LLMs) by decomposing and reconstructing malicious prompts. Traditional methods that nest entire harmful prompts are ineffective at concealing malicious intent and can be easily detected. DrAttack addresses this by breaking down the prompt into separate sub-prompts, which are then reconstructed using in-context learning with benign and semantically similar demos. This method significantly reduces the number of queries needed to achieve a successful attack, demonstrating substantial gains over prior state-of-the-art (SOTA) prompt-only attackers. Empirical studies across multiple open-source and closed-source LLMs show that DrAttack achieves a success rate of 78.0% on GPT-4 with only 15 queries, surpassing previous methods by 33.1%. The framework includes three key components: decomposition, reconstruction, and synonym search, which together enhance the effectiveness and efficiency of the attack. The paper also discusses related work, experimental setup, results, and ablation studies, highlighting the robustness and potential impact of DrAttack on LLM security.
Reach us at info@study.space
Understanding DrAttack%3A Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers