Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens

Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens

4 Jun 2024 | Jiahao Yu†‡1, Haozheng Luo†‡2, Jerry Yao-Chieh Hu†3, Wenbo Guo‡4, Han Liu‡5, Xinyu Xing†6
This paper introduces BOOST, a novel jailbreaking attack that leverages *eos* tokens to enhance the performance of existing jailbreak methods. The authors demonstrate that by appending a few *eos* tokens to the end of a harmful question, attackers can bypass the safety alignment of large language models (LLMs) and force the models to respond with harmful content. The paper shows that *eos* tokens have low attention values and do not affect the LLM's understanding of the harmful questions, allowing the model to respond to the questions. The authors conduct empirical analyses to understand this phenomenon and apply BOOST to four representative jailbreak methods, demonstrating significant improvements in attack success rates. The findings highlight the fragility of LLMs against jailbreak attacks and motivate the development of stronger safety alignment approaches. The paper also discusses the broader implications and potential risks, emphasizing the need for researchers and developers to consider the security implications of *eos* tokens in their models.This paper introduces BOOST, a novel jailbreaking attack that leverages *eos* tokens to enhance the performance of existing jailbreak methods. The authors demonstrate that by appending a few *eos* tokens to the end of a harmful question, attackers can bypass the safety alignment of large language models (LLMs) and force the models to respond with harmful content. The paper shows that *eos* tokens have low attention values and do not affect the LLM's understanding of the harmful questions, allowing the model to respond to the questions. The authors conduct empirical analyses to understand this phenomenon and apply BOOST to four representative jailbreak methods, demonstrating significant improvements in attack success rates. The findings highlight the fragility of LLMs against jailbreak attacks and motivate the development of stronger safety alignment approaches. The paper also discusses the broader implications and potential risks, emphasizing the need for researchers and developers to consider the security implications of *eos* tokens in their models.
Reach us at info@study.space