CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion

CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion

9 Jun 2024 | Qibing Ren, Chang Gao, Jing Shao, Junchi Yan, Xin Tan, Wai Lam, Lizhuang Ma
This paper introduces CodeAttack, a framework that evaluates the safety generalization of large language models (LLMs) by transforming natural language inputs into code inputs. The study reveals that LLMs, including GPT-4, Claude-2, and Llama-2 series, are vulnerable to code-based prompts, with CodeAttack bypassing safety guardrails in over 80% of cases. The framework uses three components: input encoding, task understanding, and output specification, which transform text-based queries into code-based tasks. The results show that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization. Additionally, more powerful models like GPT-4 and Claude-2 are still vulnerable to CodeAttack, and the use of less popular programming languages increases the attack success rate. The study highlights the need for more robust safety alignment algorithms to address the safety risks in code-based domains. The findings suggest that current safety mechanisms are insufficient for novel domains, and further research is needed to develop more effective safety alignment techniques.This paper introduces CodeAttack, a framework that evaluates the safety generalization of large language models (LLMs) by transforming natural language inputs into code inputs. The study reveals that LLMs, including GPT-4, Claude-2, and Llama-2 series, are vulnerable to code-based prompts, with CodeAttack bypassing safety guardrails in over 80% of cases. The framework uses three components: input encoding, task understanding, and output specification, which transform text-based queries into code-based tasks. The results show that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization. Additionally, more powerful models like GPT-4 and Claude-2 are still vulnerable to CodeAttack, and the use of less popular programming languages increases the attack success rate. The study highlights the need for more robust safety alignment algorithms to address the safety risks in code-based domains. The findings suggest that current safety mechanisms are insufficient for novel domains, and further research is needed to develop more effective safety alignment techniques.
Reach us at info@study.space
[slides] CodeAttack%3A Revealing Safety Generalization Challenges of Large Language Models via Code Completion | StudySpace