[slides] Weak-to-Strong Jailbreaking on Large Language Models

This paper addresses the vulnerability of large language models (LLMs) to jailbreak attacks, which can lead to harmful, unethical, or biased text generation. Existing methods for jailbreaking are computationally expensive, so the authors propose a new approach called *weak-to-strong* jailbreaking, which is more efficient. The key insight is that jailbroken and aligned models differ only in their initial decoding distributions. The weak-to-strong attack uses two smaller models (a safe and an unsafe one) to adversarially modify the decoding probabilities of a larger, safe model. The authors evaluate this attack on five diverse LLMs from three organizations and find that it can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. The study highlights an urgent safety issue and proposes a defense strategy to protect against such attacks, though more advanced defenses remain challenging. The code for replicating the method is available at <https://github.com/XuandongZhao/weak-to-strong>.This paper addresses the vulnerability of large language models (LLMs) to jailbreak attacks, which can lead to harmful, unethical, or biased text generation. Existing methods for jailbreaking are computationally expensive, so the authors propose a new approach called *weak-to-strong* jailbreaking, which is more efficient. The key insight is that jailbroken and aligned models differ only in their initial decoding distributions. The weak-to-strong attack uses two smaller models (a safe and an unsafe one) to adversarially modify the decoding probabilities of a larger, safe model. The authors evaluate this attack on five diverse LLMs from three organizations and find that it can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. The study highlights an urgent safety issue and proposes a defense strategy to protect against such attacks, though more advanced defenses remain challenging. The code for replicating the method is available at <https://github.com/XuandongZhao/weak-to-strong>.

Weak-to-Strong Jailbreaking on Large Language Models

5 Feb 2024 | Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang