Weak-to-Strong Jailbreaking on Large Language Models

Weak-to-Strong Jailbreaking on Large Language Models

5 Feb 2024 | Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang
This paper introduces a weak-to-strong jailbreaking attack on large language models (LLMs), which exploits the vulnerability that safe and unsafe LLMs differ only in their initial decoding distributions. The attack uses smaller, unsafe models to manipulate the decoding probabilities of significantly larger safe models, enabling the generation of harmful content with minimal computational resources. The key insight is that the top-ranked tokens in jailbroken LLMs are largely found within the top ten tokens ranked by safe LLMs, allowing the attack to steer the larger model toward harmful outputs. The attack requires only one forward pass per example and can achieve over 99% misalignment rates on two datasets. The study highlights the urgent safety issue of aligning LLMs, as even carefully designed alignment mechanisms may fail to prevent malicious misuse. The paper proposes a defense strategy based on gradient ascent to reduce the attack success rate by 20%. However, creating more advanced defenses remains challenging. The effectiveness of the weak-to-strong attack is demonstrated across five LLMs from three organizations, showing that it outperforms existing methods in terms of attack success rates and harmfulness of generated outputs. The attack is also effective across different languages and model sizes, indicating a universal vulnerability in LLMs. The paper emphasizes the need for improved alignment strategies to mitigate the risks associated with jailbreaking attacks.This paper introduces a weak-to-strong jailbreaking attack on large language models (LLMs), which exploits the vulnerability that safe and unsafe LLMs differ only in their initial decoding distributions. The attack uses smaller, unsafe models to manipulate the decoding probabilities of significantly larger safe models, enabling the generation of harmful content with minimal computational resources. The key insight is that the top-ranked tokens in jailbroken LLMs are largely found within the top ten tokens ranked by safe LLMs, allowing the attack to steer the larger model toward harmful outputs. The attack requires only one forward pass per example and can achieve over 99% misalignment rates on two datasets. The study highlights the urgent safety issue of aligning LLMs, as even carefully designed alignment mechanisms may fail to prevent malicious misuse. The paper proposes a defense strategy based on gradient ascent to reduce the attack success rate by 20%. However, creating more advanced defenses remains challenging. The effectiveness of the weak-to-strong attack is demonstrated across five LLMs from three organizations, showing that it outperforms existing methods in terms of attack success rates and harmfulness of generated outputs. The attack is also effective across different languages and model sizes, indicating a universal vulnerability in LLMs. The paper emphasizes the need for improved alignment strategies to mitigate the risks associated with jailbreaking attacks.
Reach us at info@study.space
Understanding Weak-to-Strong Jailbreaking on Large Language Models