[slides] CodeAttack%3A Revealing Safety Generalization Challenges of Large Language Models via Code Completion

The paper "CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion" explores the safety vulnerabilities of large language models (LLMs) when exposed to code inputs. The authors introduce CodeAttack, a framework that transforms natural language inputs into code inputs, creating a novel environment to test the safety generalization of LLMs. Their comprehensive studies on state-of-the-art LLMs, including GPT-4, Claude-2, and Llama-2 series, reveal that CodeAttack bypasses the safety guardrails of these models more than 80% of the time. Key findings include: 1. **Distribution Gap Impact**: Models are more likely to exhibit unsafe behavior when the encoded input is less similar to natural language, indicating a weaker safety generalization. 2. **Model Size and Safety**: Larger models do not necessarily lead to better safety behavior; for example, Claude-2 and GPT-4 still show significant vulnerability to CodeAttack. 3. **Imbalanced Distribution**: The imbalanced distribution of programming languages in the code training corpus further widens the safety generalization gap, with models performing worse on less popular languages like Go compared to Python. The authors hypothesize that the misaligned bias acquired by LLMs during code training, prioritizing code completion over safety, is a major factor in the success of CodeAttack. They also discuss potential mitigation measures and emphasize the need for more robust safety alignment algorithms to address these new safety risks in the code domain.The paper "CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion" explores the safety vulnerabilities of large language models (LLMs) when exposed to code inputs. The authors introduce CodeAttack, a framework that transforms natural language inputs into code inputs, creating a novel environment to test the safety generalization of LLMs. Their comprehensive studies on state-of-the-art LLMs, including GPT-4, Claude-2, and Llama-2 series, reveal that CodeAttack bypasses the safety guardrails of these models more than 80% of the time. Key findings include: 1. **Distribution Gap Impact**: Models are more likely to exhibit unsafe behavior when the encoded input is less similar to natural language, indicating a weaker safety generalization. 2. **Model Size and Safety**: Larger models do not necessarily lead to better safety behavior; for example, Claude-2 and GPT-4 still show significant vulnerability to CodeAttack. 3. **Imbalanced Distribution**: The imbalanced distribution of programming languages in the code training corpus further widens the safety generalization gap, with models performing worse on less popular languages like Go compared to Python. The authors hypothesize that the misaligned bias acquired by LLMs during code training, prioritizing code completion over safety, is a major factor in the success of CodeAttack. They also discuss potential mitigation measures and emphasize the need for more robust safety alignment algorithms to address these new safety risks in the code domain.

CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion

9 Jun 2024 | Qibing Ren1*, Chang Gao2*, Jing Shao3†, Junchi Yan1, Xin Tan4, Wai Lam2, Lizhuang Ma1†

9 Jun 2024 | Qibing Ren1, Chang Gao2, Jing Shao3†, Junchi Yan1, Xin Tan4, Wai Lam2, Lizhuang Ma1†