COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

2024 | Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, Bin Hu
COLD-Attack is a novel framework for generating stealthy and controllable adversarial prompts against large language models (LLMs). The paper introduces a method that leverages energy-based constrained decoding with Langevin dynamics (COLD) to enable automated search for adversarial attacks under various control requirements, such as fluency, stealthiness, sentiment, and left-right coherence. The framework unifies and automates the search of adversarial LLM attacks, allowing for diverse attack scenarios that go beyond traditional settings. COLD-Attack is evaluated on multiple LLMs, including Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5, and GPT-4, demonstrating high success rates, strong controllability, and attack transferability. The method outperforms existing techniques in terms of fluency, stealthiness, and diversity of generated attacks. The paper also highlights the importance of controllability in assessing LLM safety and emphasizes the need for robust defenses against adversarial prompts. The study underscores the potential risks associated with adversarial attacks and calls for further research into effective defense mechanisms to enhance LLM safety and ethical deployment.COLD-Attack is a novel framework for generating stealthy and controllable adversarial prompts against large language models (LLMs). The paper introduces a method that leverages energy-based constrained decoding with Langevin dynamics (COLD) to enable automated search for adversarial attacks under various control requirements, such as fluency, stealthiness, sentiment, and left-right coherence. The framework unifies and automates the search of adversarial LLM attacks, allowing for diverse attack scenarios that go beyond traditional settings. COLD-Attack is evaluated on multiple LLMs, including Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5, and GPT-4, demonstrating high success rates, strong controllability, and attack transferability. The method outperforms existing techniques in terms of fluency, stealthiness, and diversity of generated attacks. The paper also highlights the importance of controllability in assessing LLM safety and emphasizes the need for robust defenses against adversarial prompts. The study underscores the potential risks associated with adversarial attacks and calls for further research into effective defense mechanisms to enhance LLM safety and ethical deployment.
Reach us at info@study.space
Understanding COLD-Attack%3A Jailbreaking LLMs with Stealthiness and Controllability