COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

2024 | Xingang Guo * 1 Fangxu Yu * 2 Huan Zhang 1 Lianhui Qin 2 3 Bin Hu 1
The paper "COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability" addresses the challenge of generating controllable adversarial attacks on large language models (LLMs). The authors formulate the problem of controllable attack generation and connect it to controllable text generation, a well-studied topic in natural language processing (NLP). They adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art algorithm for controllable text generation, to create the COLD-Attack framework. This framework automates the search for adversarial LLM attacks under various control requirements such as fluency, stealthiness, sentiment, and left-right coherence. Key contributions of the paper include: 1. Formulating the controllable attack generation problem and connecting it to controllable text generation. 2. Adapting COLD to develop COLD-Attack, which efficiently generates adversarial prompts that are both fluent and controllable. 3. Demonstrating the effectiveness of COLD-Attack through extensive experiments on various LLMs, showing its broad applicability, strong controllability, high success rate, and attack transferability. The paper evaluates COLD-Attack in three settings: attacks with continuation constraints, paraphrasing constraints, and position constraints. Results show that COLD-Attack outperforms existing methods in terms of attack success rates, fluency, and diversity. The authors also discuss the ethical implications of their work and suggest future research directions to address the potential negative societal impact of adversarial prompts.The paper "COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability" addresses the challenge of generating controllable adversarial attacks on large language models (LLMs). The authors formulate the problem of controllable attack generation and connect it to controllable text generation, a well-studied topic in natural language processing (NLP). They adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art algorithm for controllable text generation, to create the COLD-Attack framework. This framework automates the search for adversarial LLM attacks under various control requirements such as fluency, stealthiness, sentiment, and left-right coherence. Key contributions of the paper include: 1. Formulating the controllable attack generation problem and connecting it to controllable text generation. 2. Adapting COLD to develop COLD-Attack, which efficiently generates adversarial prompts that are both fluent and controllable. 3. Demonstrating the effectiveness of COLD-Attack through extensive experiments on various LLMs, showing its broad applicability, strong controllability, high success rate, and attack transferability. The paper evaluates COLD-Attack in three settings: attacks with continuation constraints, paraphrasing constraints, and position constraints. Results show that COLD-Attack outperforms existing methods in terms of attack success rates, fluency, and diversity. The authors also discuss the ethical implications of their work and suggest future research directions to address the potential negative societal impact of adversarial prompts.
Reach us at info@study.space