Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

26 Nov 2024 | Lizhi Lin, Honglin Mu, Zenan Zhai, Minghan Wang, Yuxia Wang, Renxi Wang, Junjie Gao, Yixuan Zhang, Wanxiang Che, Timothy Baldwin, Xudong Han, Haonan Li
This paper presents a comprehensive survey of red teaming for generative models, focusing on attack strategies, evaluation metrics, and defense mechanisms. The authors analyze over 120 papers and propose a taxonomy of fine-grained attack strategies based on the inherent capabilities of language models. They also introduce the "searcher" framework to unify various automatic red teaming approaches. The survey covers novel areas including multimodal attacks, risks around LLM-based agents, overkill of harmless queries, and the balance between harmlessness and helpfulness. The paper warns that it contains examples that may be offensive, harmful, or biased. The paper introduces a risk taxonomy for large language models (LLMs), discussing different types of risks, including harm types, targets, domains, and scenarios. It also explores various attack strategies, such as completion compliance, instruction indirection, generalization glide, and model manipulation. The authors propose a novel framing of automated red-teaming methods as search problems, breaking them into three components: the state space, search goal, and search operation. They also discuss evaluation benchmarks and metrics, including attack success rate (ASR), and defense evaluation methods. The paper highlights the importance of multimodal model red teaming and LLM-based application red teaming, discussing the safety of downstream applications powered by LLMs. It also identifies key future directions for advancing language model safety, including expanding the safety landscape, unified and realistic evaluation, and advanced defense mechanisms. The authors conclude by emphasizing the need for adaptive evaluation frameworks and advanced defenses to address evolving risks.This paper presents a comprehensive survey of red teaming for generative models, focusing on attack strategies, evaluation metrics, and defense mechanisms. The authors analyze over 120 papers and propose a taxonomy of fine-grained attack strategies based on the inherent capabilities of language models. They also introduce the "searcher" framework to unify various automatic red teaming approaches. The survey covers novel areas including multimodal attacks, risks around LLM-based agents, overkill of harmless queries, and the balance between harmlessness and helpfulness. The paper warns that it contains examples that may be offensive, harmful, or biased. The paper introduces a risk taxonomy for large language models (LLMs), discussing different types of risks, including harm types, targets, domains, and scenarios. It also explores various attack strategies, such as completion compliance, instruction indirection, generalization glide, and model manipulation. The authors propose a novel framing of automated red-teaming methods as search problems, breaking them into three components: the state space, search goal, and search operation. They also discuss evaluation benchmarks and metrics, including attack success rate (ASR), and defense evaluation methods. The paper highlights the importance of multimodal model red teaming and LLM-based application red teaming, discussing the safety of downstream applications powered by LLMs. It also identifies key future directions for advancing language model safety, including expanding the safety landscape, unified and realistic evaluation, and advanced defense mechanisms. The authors conclude by emphasizing the need for adaptive evaluation frameworks and advanced defenses to address evolving risks.
Reach us at info@study.space
[slides and audio] Against The Achilles' Heel%3A A Survey on Red Teaming for Generative Models