19 Jun 2024 | Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, Kailong Wang
This paper introduces JAILMINE, a novel token-level manipulation approach for jailbreaking large language models (LLMs). The method leverages logit-based optimization to elicit harmful responses by strategically manipulating the token sequence fed into the LLM. Through extensive testing on five open-source LLMs and two datasets, JAILMINE achieves a significant reduction in time consumption (86%) while maintaining high success rates (95%). The approach is effective against evolving defensive strategies and outperforms three baseline methods in terms of both effectiveness and efficiency.
The paper highlights the vulnerability of LLMs to jailbreaking attacks, which can lead to the generation of harmful content such as promoting violence, spreading disinformation, or enabling cyber attacks. The study demonstrates that LLMs are more likely to generate harmful content when prompted with affirmative responses, and that denial responses are often limited to a few predefined patterns. JAILMINE exploits these patterns by manipulating the logits of the LLM to force it to generate harmful content.
The methodology involves generating positive responses to harmful prompts, manipulating the logits of the LLM to encourage harmful outputs, and using a sorting model to select the most effective manipulated logit sequences. The approach is efficient and scalable, making it a promising tool for identifying vulnerabilities in LLMs. The paper also discusses the ethical considerations of jailbreaking attacks and emphasizes the importance of developing robust defenses against such threats. The code for JAILMINE is available at https://github.com/LLM-Integrity-Guard/JailMine.This paper introduces JAILMINE, a novel token-level manipulation approach for jailbreaking large language models (LLMs). The method leverages logit-based optimization to elicit harmful responses by strategically manipulating the token sequence fed into the LLM. Through extensive testing on five open-source LLMs and two datasets, JAILMINE achieves a significant reduction in time consumption (86%) while maintaining high success rates (95%). The approach is effective against evolving defensive strategies and outperforms three baseline methods in terms of both effectiveness and efficiency.
The paper highlights the vulnerability of LLMs to jailbreaking attacks, which can lead to the generation of harmful content such as promoting violence, spreading disinformation, or enabling cyber attacks. The study demonstrates that LLMs are more likely to generate harmful content when prompted with affirmative responses, and that denial responses are often limited to a few predefined patterns. JAILMINE exploits these patterns by manipulating the logits of the LLM to force it to generate harmful content.
The methodology involves generating positive responses to harmful prompts, manipulating the logits of the LLM to encourage harmful outputs, and using a sorting model to select the most effective manipulated logit sequences. The approach is efficient and scalable, making it a promising tool for identifying vulnerabilities in LLMs. The paper also discusses the ethical considerations of jailbreaking attacks and emphasizes the importance of developing robust defenses against such threats. The code for JAILMINE is available at https://github.com/LLM-Integrity-Guard/JailMine.