2024 | Chujie Zheng¹², Fan Yin², Hao Zhou³, Fandong Meng³, Jie Zhou³, Kai-Wei Chang², Minlie Huang¹, Nanyun Peng²
This paper investigates the mechanisms behind safety prompts used to safeguard large language models (LLMs) from harmful queries. Safety prompts are commonly used to guide LLMs to refuse harmful queries, but their underlying mechanisms remain unclear, limiting their automatic optimization. The authors analyze how safety prompts affect LLM behavior through model representations, finding that they move queries' representations in a "higher-refusal" direction, increasing the likelihood of refusal even for harmless queries. However, LLMs naturally distinguish harmful and harmless queries without safety prompts.
Inspired by these findings, the authors propose DRO (Directed Representation Optimization), a method for optimizing safety prompts by adjusting their representations along or opposite the refusal direction. DRO treats safety prompts as continuous, trainable embeddings and optimizes them to enhance refusal for harmful queries and reduce refusal for harmless ones. Experiments on eight LLMs show that DRO significantly improves the effectiveness of human-crafted safety prompts without compromising general performance.
DRO is evaluated on out-of-domain and jailbreak benchmarks, demonstrating its effectiveness in improving safeguarding performance. It also maintains the models' general performance on AlpacaEval and shows robustness to variations in data used for anchoring the low-dimensional space. The method is also tested in a jailbreak setting, where it remains effective in improving safety prompt performance.
The authors also analyze the interpretability of DRO, finding that optimized safety prompts are closely related to the original textual prompts. They conclude that DRO provides a principled approach to optimizing safety prompts, enhancing LLM safety without compromising general performance. The work highlights the importance of understanding the mechanisms behind safety prompts and encourages further research into LLM safety.This paper investigates the mechanisms behind safety prompts used to safeguard large language models (LLMs) from harmful queries. Safety prompts are commonly used to guide LLMs to refuse harmful queries, but their underlying mechanisms remain unclear, limiting their automatic optimization. The authors analyze how safety prompts affect LLM behavior through model representations, finding that they move queries' representations in a "higher-refusal" direction, increasing the likelihood of refusal even for harmless queries. However, LLMs naturally distinguish harmful and harmless queries without safety prompts.
Inspired by these findings, the authors propose DRO (Directed Representation Optimization), a method for optimizing safety prompts by adjusting their representations along or opposite the refusal direction. DRO treats safety prompts as continuous, trainable embeddings and optimizes them to enhance refusal for harmful queries and reduce refusal for harmless ones. Experiments on eight LLMs show that DRO significantly improves the effectiveness of human-crafted safety prompts without compromising general performance.
DRO is evaluated on out-of-domain and jailbreak benchmarks, demonstrating its effectiveness in improving safeguarding performance. It also maintains the models' general performance on AlpacaEval and shows robustness to variations in data used for anchoring the low-dimensional space. The method is also tested in a jailbreak setting, where it remains effective in improving safety prompt performance.
The authors also analyze the interpretability of DRO, finding that optimized safety prompts are closely related to the original textual prompts. They conclude that DRO provides a principled approach to optimizing safety prompts, enhancing LLM safety without compromising general performance. The work highlights the importance of understanding the mechanisms behind safety prompts and encourages further research into LLM safety.