4 Jun 2024 | Jiahao Yu, Haozheng Luo, Jerry Yao-Chieh Hu, Wenbo Guo, Han Liu, Xinyu Xing
This paper introduces BOOST, a simple and effective method for enhancing jailbreaking attacks against large language models (LLMs). BOOST leverages the end-of-sequence (eos) token to bypass the safety alignment of LLMs, allowing attackers to generate harmful content without triggering the model's safety mechanisms. The method works by appending a few eos tokens to the end of harmful prompts, which shifts the hidden representation of the input toward the ethical boundary, making the model respond rather than refuse. This approach is simple, requiring no complex algorithms or human expertise, and can be applied to various existing jailbreak strategies.
The paper demonstrates that adding eos tokens significantly improves the success rate of jailbreaking attacks. Empirical analysis shows that eos tokens have low attention values and do not distract the model's focus from the harmful content, allowing the LLM to respond appropriately. This property makes eos tokens an effective tool for jailbreaking without misleading the model to generate irrelevant content.
BOOST is tested on 12 LLMs, including Llama-2, Qwen, and Gemma, and shows consistent performance improvements across different models. The results indicate that BOOST is a general strategy that can enhance the effectiveness of existing jailbreak methods. Additionally, the paper highlights the fragility of LLMs against jailbreaking attacks and the need for stronger safety alignment mechanisms.
The study also explores the impact of different tokens on jailbreaking performance, finding that eos tokens are the most effective. However, the paper notes that proprietary LLMs may filter out eos tokens, reducing the effectiveness of BOOST in such cases. Overall, the findings underscore the importance of addressing the security risks associated with eos tokens in LLMs and the need for robust safety measures to prevent harmful outputs.This paper introduces BOOST, a simple and effective method for enhancing jailbreaking attacks against large language models (LLMs). BOOST leverages the end-of-sequence (eos) token to bypass the safety alignment of LLMs, allowing attackers to generate harmful content without triggering the model's safety mechanisms. The method works by appending a few eos tokens to the end of harmful prompts, which shifts the hidden representation of the input toward the ethical boundary, making the model respond rather than refuse. This approach is simple, requiring no complex algorithms or human expertise, and can be applied to various existing jailbreak strategies.
The paper demonstrates that adding eos tokens significantly improves the success rate of jailbreaking attacks. Empirical analysis shows that eos tokens have low attention values and do not distract the model's focus from the harmful content, allowing the LLM to respond appropriately. This property makes eos tokens an effective tool for jailbreaking without misleading the model to generate irrelevant content.
BOOST is tested on 12 LLMs, including Llama-2, Qwen, and Gemma, and shows consistent performance improvements across different models. The results indicate that BOOST is a general strategy that can enhance the effectiveness of existing jailbreak methods. Additionally, the paper highlights the fragility of LLMs against jailbreaking attacks and the need for stronger safety alignment mechanisms.
The study also explores the impact of different tokens on jailbreaking performance, finding that eos tokens are the most effective. However, the paper notes that proprietary LLMs may filter out eos tokens, reducing the effectiveness of BOOST in such cases. Overall, the findings underscore the importance of addressing the security risks associated with eos tokens in LLMs and the need for robust safety measures to prevent harmful outputs.