CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

26 Feb 2024 | Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang
CodeChameleon is a novel jailbreak framework for Large Language Models (LLMs) that uses personalized encryption to bypass safety mechanisms. The framework is based on the hypothesis that LLMs first detect malicious intent in queries and then generate responses. To evade intent recognition, CodeChameleon reformulates tasks into code completion formats, allowing users to encrypt queries using personalized encryption functions. These functions transform queries into formats not present during the alignment phase, effectively bypassing intent recognition. To ensure response generation, decryption functions are embedded in instructions, enabling LLMs to decrypt and execute encrypted queries. The framework includes four distinct encryption functions based on reverse order, word length, odd and even positions, and binary tree structure. Corresponding decryption functions are also designed to help LLMs understand and execute encrypted queries. CodeChameleon was tested on seven LLMs, achieving a state-of-the-art average Attack Success Rate (ASR) of 77.5%, with an 86.6% ASR on GPT-4-1106. The results show that CodeChameleon effectively circumvents existing safety mechanisms, demonstrating its effectiveness in bypassing LLMs' defenses. The experiments compared CodeChameleon with existing jailbreak methods, including GCG, AutoDAN, PAIR, Jailbroken, and CipherChat. CodeChameleon outperformed these methods in terms of ASR, particularly on Llama2 and GPT series models. However, it was less effective on Vicuna models, where optimization-based baselines performed better. The study also found that larger models do not necessarily have better safety, and that code capabilities in models can increase risks if not properly aligned with safety protocols. The framework's effectiveness was validated through extensive experiments and evaluations, showing that CodeChameleon can successfully bypass LLMs' safety mechanisms and induce harmful responses. The results highlight the importance of robust safety alignment methods to prevent adversarial attacks on LLMs.CodeChameleon is a novel jailbreak framework for Large Language Models (LLMs) that uses personalized encryption to bypass safety mechanisms. The framework is based on the hypothesis that LLMs first detect malicious intent in queries and then generate responses. To evade intent recognition, CodeChameleon reformulates tasks into code completion formats, allowing users to encrypt queries using personalized encryption functions. These functions transform queries into formats not present during the alignment phase, effectively bypassing intent recognition. To ensure response generation, decryption functions are embedded in instructions, enabling LLMs to decrypt and execute encrypted queries. The framework includes four distinct encryption functions based on reverse order, word length, odd and even positions, and binary tree structure. Corresponding decryption functions are also designed to help LLMs understand and execute encrypted queries. CodeChameleon was tested on seven LLMs, achieving a state-of-the-art average Attack Success Rate (ASR) of 77.5%, with an 86.6% ASR on GPT-4-1106. The results show that CodeChameleon effectively circumvents existing safety mechanisms, demonstrating its effectiveness in bypassing LLMs' defenses. The experiments compared CodeChameleon with existing jailbreak methods, including GCG, AutoDAN, PAIR, Jailbroken, and CipherChat. CodeChameleon outperformed these methods in terms of ASR, particularly on Llama2 and GPT series models. However, it was less effective on Vicuna models, where optimization-based baselines performed better. The study also found that larger models do not necessarily have better safety, and that code capabilities in models can increase risks if not properly aligned with safety protocols. The framework's effectiveness was validated through extensive experiments and evaluations, showing that CodeChameleon can successfully bypass LLMs' safety mechanisms and induce harmful responses. The results highlight the importance of robust safety alignment methods to prevent adversarial attacks on LLMs.
Reach us at info@study.space