[slides and audio] CodeChameleon%3A Personalized Encryption Framework for Jailbreaking Large Language Models

**CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models** This paper addresses the challenge of adversarial misuse, particularly through 'jailbreaking' that circumvents safety and ethical protocols in Large Language Models (LLMs). The authors propose a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Based on this hypothesis, they introduce CodeChameleon, a novel jailbreak framework that uses personalized encryption tactics to bypass intent security recognition and ensure successful response generation. **Key Contributions:** 1. **Hypothesis:** Intent security recognition followed by response generation. 2. **Framework:** CodeChameleon, which employs personalized encryption and decryption functions. 3. **Experiments:** Extensive tests on 7 LLMs show an average Attack Success Rate (ASR) of 77.5%, significantly outperforming existing methods. **Methods:** - **Encryption:** Personalized encryption functions transform queries into formats not present during alignment, making them difficult for LLMs to detect. - **Decryption:** Decryption functions assist LLMs in understanding and executing encrypted queries. - **Code Completion Task:** Tasks are reformulated as code completion tasks to enable encryption and decryption. **Results:** - **ASR:** CodeChameleon achieves an average ASR of 77.5%, with notable success on GPT-4-1106 (86.6% ASR). - **Model Size:** Larger models do not necessarily have better safety, as defense capabilities do not scale linearly with model size. - **Code Capabilities:** Models with stronger code capabilities are more susceptible to CodeChameleon. **Conclusion:** CodeChameleon effectively circumvents LLMs' safety mechanisms, achieving state-of-the-art ASR. The framework's effectiveness is validated through extensive experiments and analysis, highlighting the need for more robust safety alignment methods.**CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models** This paper addresses the challenge of adversarial misuse, particularly through 'jailbreaking' that circumvents safety and ethical protocols in Large Language Models (LLMs). The authors propose a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Based on this hypothesis, they introduce CodeChameleon, a novel jailbreak framework that uses personalized encryption tactics to bypass intent security recognition and ensure successful response generation. **Key Contributions:** 1. **Hypothesis:** Intent security recognition followed by response generation. 2. **Framework:** CodeChameleon, which employs personalized encryption and decryption functions. 3. **Experiments:** Extensive tests on 7 LLMs show an average Attack Success Rate (ASR) of 77.5%, significantly outperforming existing methods. **Methods:** - **Encryption:** Personalized encryption functions transform queries into formats not present during alignment, making them difficult for LLMs to detect. - **Decryption:** Decryption functions assist LLMs in understanding and executing encrypted queries. - **Code Completion Task:** Tasks are reformulated as code completion tasks to enable encryption and decryption. **Results:** - **ASR:** CodeChameleon achieves an average ASR of 77.5%, with notable success on GPT-4-1106 (86.6% ASR). - **Model Size:** Larger models do not necessarily have better safety, as defense capabilities do not scale linearly with model size. - **Code Capabilities:** Models with stronger code capabilities are more susceptible to CodeChameleon. **Conclusion:** CodeChameleon effectively circumvents LLMs' safety mechanisms, achieving state-of-the-art ASR. The framework's effectiveness is validated through extensive experiments and analysis, highlighting the need for more robust safety alignment methods.

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

26 Feb 2024 | Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang