3 Jun 2024 | Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, Nanyun Peng
This paper investigates the mechanisms behind safety prompts used to safeguard large language models (LLMs) from harmful queries. The authors find that while LLMs can distinguish between harmful and harmless queries, safety prompts tend to increase the overall probability of refusal, even for harmless queries. To address this issue, they propose DRO (Directed Representation Optimization), a method that optimizes continuous safety prompts to move query representations in a direction that reduces the likelihood of refusal for harmful queries. Experiments on eight LLMs using out-of-domain and jailbreak benchmarks demonstrate that DRO significantly improves the effectiveness of human-crafted safety prompts without compromising general performance. The method also shows robustness to variations in anchor data and maintains the models' general capabilities. The paper highlights the importance of understanding the intrinsic mechanisms of prompt-driven safeguarding and suggests future research directions in LLM safety.This paper investigates the mechanisms behind safety prompts used to safeguard large language models (LLMs) from harmful queries. The authors find that while LLMs can distinguish between harmful and harmless queries, safety prompts tend to increase the overall probability of refusal, even for harmless queries. To address this issue, they propose DRO (Directed Representation Optimization), a method that optimizes continuous safety prompts to move query representations in a direction that reduces the likelihood of refusal for harmful queries. Experiments on eight LLMs using out-of-domain and jailbreak benchmarks demonstrate that DRO significantly improves the effectiveness of human-crafted safety prompts without compromising general performance. The method also shows robustness to variations in anchor data and maintains the models' general capabilities. The paper highlights the importance of understanding the intrinsic mechanisms of prompt-driven safeguarding and suggests future research directions in LLM safety.