[slides and audio] Coercing LLMs to do and reveal (almost) anything

This paper explores the broader spectrum of adversarial attacks on large language models (LLMs), beyond the narrow focus on jailbreaking. The authors provide a comprehensive overview of various attack surfaces and goals, including misdirection, model control, denial-of-service, and data extraction. Through controlled experiments, they analyze the effectiveness of these attacks, finding that many stem from pre-training LLMs with coding capabilities and the presence of 'glitch' tokens in common LLM vocabularies. The study concludes that the security of LLMs must be addressed through a thorough understanding of their capabilities and limitations, highlighting the broad range of potential attack vectors and the need for robust defenses. The paper also discusses the implications of these findings for the deployment of LLMs in various applications, emphasizing the importance of considering the security risks associated with free-form text inputs.This paper explores the broader spectrum of adversarial attacks on large language models (LLMs), beyond the narrow focus on jailbreaking. The authors provide a comprehensive overview of various attack surfaces and goals, including misdirection, model control, denial-of-service, and data extraction. Through controlled experiments, they analyze the effectiveness of these attacks, finding that many stem from pre-training LLMs with coding capabilities and the presence of 'glitch' tokens in common LLM vocabularies. The study concludes that the security of LLMs must be addressed through a thorough understanding of their capabilities and limitations, highlighting the broad range of potential attack vectors and the need for robust defenses. The paper also discusses the implications of these findings for the deployment of LLMs in various applications, emphasizing the importance of considering the security risks associated with free-form text inputs.

COERCING LLMs TO DO AND REVEAL (ALMOST) ANYTHING

21 Feb 2024 | Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, Tom Goldstein