26 Feb 2024 | Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang
**CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models**
This paper addresses the challenge of adversarial misuse, particularly through 'jailbreaking' that circumvents safety and ethical protocols in Large Language Models (LLMs). The authors propose a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Based on this hypothesis, they introduce CodeChameleon, a novel jailbreak framework that uses personalized encryption tactics to bypass intent security recognition and ensure successful response generation.
**Key Contributions:**
1. **Hypothesis:** Intent security recognition followed by response generation.
2. **Framework:** CodeChameleon, which employs personalized encryption and decryption functions.
3. **Experiments:** Extensive tests on 7 LLMs show an average Attack Success Rate (ASR) of 77.5%, significantly outperforming existing methods.
**Methods:**
- **Encryption:** Personalized encryption functions transform queries into formats not present during alignment, making them difficult for LLMs to detect.
- **Decryption:** Decryption functions assist LLMs in understanding and executing encrypted queries.
- **Code Completion Task:** Tasks are reformulated as code completion tasks to enable encryption and decryption.
**Results:**
- **ASR:** CodeChameleon achieves an average ASR of 77.5%, with notable success on GPT-4-1106 (86.6% ASR).
- **Model Size:** Larger models do not necessarily have better safety, as defense capabilities do not scale linearly with model size.
- **Code Capabilities:** Models with stronger code capabilities are more susceptible to CodeChameleon.
**Conclusion:**
CodeChameleon effectively circumvents LLMs' safety mechanisms, achieving state-of-the-art ASR. The framework's effectiveness is validated through extensive experiments and analysis, highlighting the need for more robust safety alignment methods.**CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models**
This paper addresses the challenge of adversarial misuse, particularly through 'jailbreaking' that circumvents safety and ethical protocols in Large Language Models (LLMs). The authors propose a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Based on this hypothesis, they introduce CodeChameleon, a novel jailbreak framework that uses personalized encryption tactics to bypass intent security recognition and ensure successful response generation.
**Key Contributions:**
1. **Hypothesis:** Intent security recognition followed by response generation.
2. **Framework:** CodeChameleon, which employs personalized encryption and decryption functions.
3. **Experiments:** Extensive tests on 7 LLMs show an average Attack Success Rate (ASR) of 77.5%, significantly outperforming existing methods.
**Methods:**
- **Encryption:** Personalized encryption functions transform queries into formats not present during alignment, making them difficult for LLMs to detect.
- **Decryption:** Decryption functions assist LLMs in understanding and executing encrypted queries.
- **Code Completion Task:** Tasks are reformulated as code completion tasks to enable encryption and decryption.
**Results:**
- **ASR:** CodeChameleon achieves an average ASR of 77.5%, with notable success on GPT-4-1106 (86.6% ASR).
- **Model Size:** Larger models do not necessarily have better safety, as defense capabilities do not scale linearly with model size.
- **Code Capabilities:** Models with stronger code capabilities are more susceptible to CodeChameleon.
**Conclusion:**
CodeChameleon effectively circumvents LLMs' safety mechanisms, achieving state-of-the-art ASR. The framework's effectiveness is validated through extensive experiments and analysis, highlighting the need for more robust safety alignment methods.