27 Feb 2024 | Zhenhong Zhou, Jiuyang Xiang, Haopeng Chen, Quan Liu, Zherui Li, Sen Su
This paper investigates the safety vulnerabilities of large language models (LLMs) in multi-turn dialogue, highlighting how malicious users can exploit the structure of multi-turn conversations to generate harmful content. Unlike single-turn interactions, where LLMs are more likely to reject harmful queries, multi-turn dialogue allows for the incremental generation of harmful responses. By decomposing a malicious question into several seemingly harmless sub-questions, LLMs can be induced to generate harmful content over multiple turns, culminating in a harmful overall response. The study demonstrates that current safety mechanisms in LLMs are insufficient to prevent such attacks, particularly in complex multi-turn scenarios. The research also proposes a method for decomposing malicious queries and evaluates the effectiveness of various LLMs in responding to these queries. The findings indicate that LLMs are vulnerable to multi-turn dialogue attacks, and that safety alignment strategies need to be improved to address these risks. The paper concludes that multi-turn dialogue presents a new challenge for LLM safety, and that further research is needed to develop effective mitigation strategies.This paper investigates the safety vulnerabilities of large language models (LLMs) in multi-turn dialogue, highlighting how malicious users can exploit the structure of multi-turn conversations to generate harmful content. Unlike single-turn interactions, where LLMs are more likely to reject harmful queries, multi-turn dialogue allows for the incremental generation of harmful responses. By decomposing a malicious question into several seemingly harmless sub-questions, LLMs can be induced to generate harmful content over multiple turns, culminating in a harmful overall response. The study demonstrates that current safety mechanisms in LLMs are insufficient to prevent such attacks, particularly in complex multi-turn scenarios. The research also proposes a method for decomposing malicious queries and evaluates the effectiveness of various LLMs in responding to these queries. The findings indicate that LLMs are vulnerable to multi-turn dialogue attacks, and that safety alignment strategies need to be improved to address these risks. The paper concludes that multi-turn dialogue presents a new challenge for LLM safety, and that further research is needed to develop effective mitigation strategies.