Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue

Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue

27 Feb 2024 | Zhenhong Zhou, Jiuyang Xiang, Haopeng Chen, Quan Liu, Zherui Li, Sen Su
This paper addresses the safety vulnerabilities of Large Language Models (LLMs) in multi-turn dialogue, a critical mode through which humans interact with LLMs. Unlike single-turn dialogue, where alignment methods like Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) ensure LLMs do not generate harmful content, multi-turn dialogue allows LLMs to be exploited to produce harmful responses over multiple turns. The authors propose a method to decompose malicious queries into sub-queries, which are then incrementally addressed by the LLMs, leading to the generation of harmful content in the final turn. Experiments on various commercial LLMs, including ChatGPT, Claude, and Gemini, demonstrate that current safety mechanisms are inadequate in multi-turn dialogue, exposing vulnerabilities that malicious users can exploit. The findings highlight the need for dedicated safety alignment in multi-turn dialogue to prevent LLMs from producing illegal or unethical content. The paper also discusses potential mitigation strategies and emphasizes the importance of enhancing LLMs' context understanding to address these safety risks.This paper addresses the safety vulnerabilities of Large Language Models (LLMs) in multi-turn dialogue, a critical mode through which humans interact with LLMs. Unlike single-turn dialogue, where alignment methods like Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) ensure LLMs do not generate harmful content, multi-turn dialogue allows LLMs to be exploited to produce harmful responses over multiple turns. The authors propose a method to decompose malicious queries into sub-queries, which are then incrementally addressed by the LLMs, leading to the generation of harmful content in the final turn. Experiments on various commercial LLMs, including ChatGPT, Claude, and Gemini, demonstrate that current safety mechanisms are inadequate in multi-turn dialogue, exposing vulnerabilities that malicious users can exploit. The findings highlight the need for dedicated safety alignment in multi-turn dialogue to prevent LLMs from producing illegal or unethical content. The paper also discusses potential mitigation strategies and emphasizes the importance of enhancing LLMs' context understanding to address these safety risks.
Reach us at info@study.space
Understanding Speak Out of Turn%3A Safety Vulnerability of Large Language Models in Multi-turn Dialogue