Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

27 Mar 2024 | Zhichen Dong*, Zanhui Zhou*, Chao Yang†, Jing Shao, Yu Qiao
This paper provides a comprehensive overview of attacks, defenses, and evaluations in the context of Large Language Models (LLMs) for conversation safety. The authors categorize their findings into three main areas: LLM attacks, LLM defenses, and evaluations. They discuss various attack methods, including inference-time and training-time approaches, and detail defense strategies such as safety alignment, inference guidance, and input/output filters. The paper also introduces evaluation datasets and metrics used to assess the effectiveness of these methods. Despite the comprehensive nature of the survey, the authors identify several challenges, including limited domain diversity of attacks, false refusal/exaggerated safety for defenses, and the need for unified evaluation standards. The paper concludes by highlighting the importance of developing socially beneficial LLMs and outlines future research directions.This paper provides a comprehensive overview of attacks, defenses, and evaluations in the context of Large Language Models (LLMs) for conversation safety. The authors categorize their findings into three main areas: LLM attacks, LLM defenses, and evaluations. They discuss various attack methods, including inference-time and training-time approaches, and detail defense strategies such as safety alignment, inference guidance, and input/output filters. The paper also introduces evaluation datasets and metrics used to assess the effectiveness of these methods. Despite the comprehensive nature of the survey, the authors identify several challenges, including limited domain diversity of attacks, false refusal/exaggerated safety for defenses, and the need for unified evaluation standards. The paper concludes by highlighting the importance of developing socially beneficial LLMs and outlines future research directions.
Reach us at info@study.space