21 Feb 2024 | Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, Tom Goldstein
This paper explores the wide range of adversarial attacks that can be applied to large language models (LLMs), arguing that the spectrum of such attacks is much broader than previously thought. It highlights various attack types, including misdirection, model control, denial-of-service, and data extraction, which can coerce LLMs into unintended behaviors. The paper provides a comprehensive analysis of these attacks, showing that many stem from the pre-training of LLMs with coding capabilities and the presence of "glitch" tokens in common LLM vocabularies that should be removed for security reasons.
The paper discusses how adversarial attacks can be used to extract information from the context window, including the system prompt, and how they can be used to misdirect models or force them to perform actions against their instructions. It also explores how these attacks can be optimized using various algorithms, such as GCG, and how they can be applied to different models and constraint sets.
The paper also discusses the implications of these attacks for the security of LLMs, arguing that the security of these models must be addressed through a comprehensive understanding of their capabilities and limitations. It highlights the importance of identifying and mitigating the risks associated with adversarial attacks, particularly in applications where LLMs are used as agents or assistants that interface with other systems.
The paper concludes that the spectrum of adversarial attacks on LLMs is much broader than previously thought, and that the security of these models must be addressed through a comprehensive understanding of their capabilities and limitations. It emphasizes the need for further research into the mechanisms of adversarial attacks and the development of effective defenses against them.This paper explores the wide range of adversarial attacks that can be applied to large language models (LLMs), arguing that the spectrum of such attacks is much broader than previously thought. It highlights various attack types, including misdirection, model control, denial-of-service, and data extraction, which can coerce LLMs into unintended behaviors. The paper provides a comprehensive analysis of these attacks, showing that many stem from the pre-training of LLMs with coding capabilities and the presence of "glitch" tokens in common LLM vocabularies that should be removed for security reasons.
The paper discusses how adversarial attacks can be used to extract information from the context window, including the system prompt, and how they can be used to misdirect models or force them to perform actions against their instructions. It also explores how these attacks can be optimized using various algorithms, such as GCG, and how they can be applied to different models and constraint sets.
The paper also discusses the implications of these attacks for the security of LLMs, arguing that the security of these models must be addressed through a comprehensive understanding of their capabilities and limitations. It highlights the importance of identifying and mitigating the risks associated with adversarial attacks, particularly in applications where LLMs are used as agents or assistants that interface with other systems.
The paper concludes that the spectrum of adversarial attacks on LLMs is much broader than previously thought, and that the security of these models must be addressed through a comprehensive understanding of their capabilities and limitations. It emphasizes the need for further research into the mechanisms of adversarial attacks and the development of effective defenses against them.