Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

27 Mar 2024 | Zhichen Dong*, Zhanhui Zhou*, Chao Yang†, Jing Shao, Yu Qiao
This survey provides a comprehensive overview of recent research on LLM conversation safety, focusing on three key aspects: attacks, defenses, and evaluations. Large Language Models (LLMs) are now widely used in conversational applications, but their potential for misuse in generating harmful responses has raised significant societal concerns. This survey aims to summarize the latest advancements in LLM conversation safety, offering a structured overview to enhance understanding and encourage further research. LLM conversation safety involves three main areas: attacks, defenses, and evaluations. Attacks aim to elicit harmful responses from LLMs, either through inference-time methods using adversarial prompts or training-time methods that modify model weights. Defenses include safety alignment, inference guidance, and input/output filtering to enhance LLM safety. Evaluations assess the effectiveness of these methods using safety datasets and metrics. Inference-time attacks include red-team attacks, which use malicious instructions to test LLM responses, and template-based attacks, which manipulate raw instructions to create adversarial prompts. Neural prompt-to-prompt attacks use parameterized models to iteratively refine prompts. Training-time attacks involve poisoning data to undermine LLM safety, including backdoor attacks that insert triggers into data to cause unsafe behavior. Defenses include safety alignment techniques like supervised fine-tuning and reinforcement learning with human feedback, inference guidance through system prompts, and input/output filters to detect and block harmful content. Evaluation methods use datasets covering topics like toxicity, discrimination, and misinformation, along with metrics such as attack success rate and fine-grained evaluation metrics. The survey highlights the challenges in LLM conversation safety, including the need for robust defenses against evolving attack methods and the importance of standardized evaluation criteria. It also emphasizes the importance of developing new methods to ensure the security of publicly fine-tunable models and prevent their misuse.This survey provides a comprehensive overview of recent research on LLM conversation safety, focusing on three key aspects: attacks, defenses, and evaluations. Large Language Models (LLMs) are now widely used in conversational applications, but their potential for misuse in generating harmful responses has raised significant societal concerns. This survey aims to summarize the latest advancements in LLM conversation safety, offering a structured overview to enhance understanding and encourage further research. LLM conversation safety involves three main areas: attacks, defenses, and evaluations. Attacks aim to elicit harmful responses from LLMs, either through inference-time methods using adversarial prompts or training-time methods that modify model weights. Defenses include safety alignment, inference guidance, and input/output filtering to enhance LLM safety. Evaluations assess the effectiveness of these methods using safety datasets and metrics. Inference-time attacks include red-team attacks, which use malicious instructions to test LLM responses, and template-based attacks, which manipulate raw instructions to create adversarial prompts. Neural prompt-to-prompt attacks use parameterized models to iteratively refine prompts. Training-time attacks involve poisoning data to undermine LLM safety, including backdoor attacks that insert triggers into data to cause unsafe behavior. Defenses include safety alignment techniques like supervised fine-tuning and reinforcement learning with human feedback, inference guidance through system prompts, and input/output filters to detect and block harmful content. Evaluation methods use datasets covering topics like toxicity, discrimination, and misinformation, along with metrics such as attack success rate and fine-grained evaluation metrics. The survey highlights the challenges in LLM conversation safety, including the need for robust defenses against evolving attack methods and the importance of standardized evaluation criteria. It also emphasizes the importance of developing new methods to ensure the security of publicly fine-tunable models and prevent their misuse.
Reach us at info@study.space