[slides] Lockpicking LLMs%3A A Logit-Based Jailbreak Using Token-level Manipulation

This paper introduces JAILMINE, an innovative token-level manipulation approach to jailbreaking large language models (LLMs). JAILMINE addresses the limitations of existing token-level jailbreaking techniques, which often face scalability and efficiency challenges, especially with frequent model updates and advanced defensive measures. The method employs an automated "mining" process to elicit malicious responses from LLMs by strategically selecting affirmative outputs and iteratively reducing the likelihood of rejection. Through rigorous testing across multiple well-known LLMs and datasets, JAILMINE demonstrates significant improvements in both effectiveness and efficiency, achieving an average reduction of 86% in time consumption while maintaining high success rates of 95%. The paper also includes an ethical discussion, emphasizing the importance of continued vigilance and proactive measures to enhance the security and reliability of LLMs. The code for JAILMINE is available at https://github.com/LLM-Integrity-Guard/JailMine.This paper introduces JAILMINE, an innovative token-level manipulation approach to jailbreaking large language models (LLMs). JAILMINE addresses the limitations of existing token-level jailbreaking techniques, which often face scalability and efficiency challenges, especially with frequent model updates and advanced defensive measures. The method employs an automated "mining" process to elicit malicious responses from LLMs by strategically selecting affirmative outputs and iteratively reducing the likelihood of rejection. Through rigorous testing across multiple well-known LLMs and datasets, JAILMINE demonstrates significant improvements in both effectiveness and efficiency, achieving an average reduction of 86% in time consumption while maintaining high success rates of 95%. The paper also includes an ethical discussion, emphasizing the importance of continued vigilance and proactive measures to enhance the security and reliability of LLMs. The code for JAILMINE is available at https://github.com/LLM-Integrity-Guard/JailMine.

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation

19 Jun 2024 | Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, Kailong Wang