[slides and audio] The Instruction Hierarchy%3A Training LLMs to Prioritize Privileged Instructions

The paper "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions" by Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel from OpenAI addresses the vulnerability of large language models (LLMs) to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite the model's original instructions with malicious prompts. The authors propose an *instruction hierarchy* to define how models should behave when instructions of different priorities conflict, prioritizing system messages over user messages and user messages over third-party content. They develop an automated data generation method to teach LLMs to selectively ignore lower-privileged instructions, using synthetic data generation and context distillation techniques. The method is applied to LLMs, demonstrating significant improvements in robustness against various attack types, even for unseen attacks, while maintaining minimal degradation in standard capabilities. The evaluation includes open-sourced and novel benchmarks, showing substantial gains in defense against system prompt extraction and jailbreaks, with some over-refusal issues observed but controllable through further data collection. The paper also discusses related work and future directions, emphasizing the importance of refining the instruction hierarchy and exploring multi-modal applications.The paper "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions" by Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel from OpenAI addresses the vulnerability of large language models (LLMs) to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite the model's original instructions with malicious prompts. The authors propose an *instruction hierarchy* to define how models should behave when instructions of different priorities conflict, prioritizing system messages over user messages and user messages over third-party content. They develop an automated data generation method to teach LLMs to selectively ignore lower-privileged instructions, using synthetic data generation and context distillation techniques. The method is applied to LLMs, demonstrating significant improvements in robustness against various attack types, even for unseen attacks, while maintaining minimal degradation in standard capabilities. The evaluation includes open-sourced and novel benchmarks, showing substantial gains in defense against system prompt extraction and jailbreaks, with some over-refusal issues observed but controllable through further data collection. The paper also discusses related work and future directions, emphasizing the importance of refining the instruction hierarchy and exploring multi-modal applications.

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

19 Apr 2024 | Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, Alex Beutel