19 Apr 2024 | Eric Wallace*, Kai Xiao*, Reimar Leike*, Lilian Weng, Johannes Heidecke, Alex Beutel
This paper introduces an instruction hierarchy to improve the robustness of large language models (LLMs) against attacks such as prompt injections, jailbreaks, and system prompt extractions. The key idea is to prioritize system messages over user and third-party inputs, ensuring that LLMs follow higher-priority instructions even when they conflict with lower-priority ones. The authors propose an automated data generation method to train LLMs to selectively ignore lower-privileged instructions, which significantly enhances the model's robustness without degrading its standard capabilities.
The paper describes various types of attacks on LLMs, including prompt injections, jailbreaks, and system message extractions. These attacks exploit the lack of a clear instruction hierarchy in LLMs, allowing adversaries to override higher-level instructions. To address this, the authors propose a framework where system messages are given higher priority, and models are trained to ignore lower-priority instructions when they conflict with higher-priority ones.
The authors generate training data using two main approaches: context synthesis for aligned instructions and context ignorance for misaligned instructions. They use synthetic data generation and context distillation to create examples that teach LLMs to follow higher-priority instructions while ignoring lower-priority ones. The results show that the instruction hierarchy significantly improves the model's robustness against various attacks, including system prompt extraction and jailbreaks, while maintaining its ability to perform standard tasks.
The paper also discusses the challenges of over-refusal, where models may refuse to follow benign instructions. The authors find that while some regressions occur, the overall performance of the models remains strong, and further data collection can help improve the refusal decision boundary.
In conclusion, the instruction hierarchy provides a framework for teaching LLMs to follow instructions while ignoring adversarial manipulation. The approach has shown promising results in improving the robustness of LLMs against various attacks, making them safer and more controllable. Future work includes refining the handling of conflicting instructions, expanding to other modalities, and exploring model architecture changes to better instill the instruction hierarchy.This paper introduces an instruction hierarchy to improve the robustness of large language models (LLMs) against attacks such as prompt injections, jailbreaks, and system prompt extractions. The key idea is to prioritize system messages over user and third-party inputs, ensuring that LLMs follow higher-priority instructions even when they conflict with lower-priority ones. The authors propose an automated data generation method to train LLMs to selectively ignore lower-privileged instructions, which significantly enhances the model's robustness without degrading its standard capabilities.
The paper describes various types of attacks on LLMs, including prompt injections, jailbreaks, and system message extractions. These attacks exploit the lack of a clear instruction hierarchy in LLMs, allowing adversaries to override higher-level instructions. To address this, the authors propose a framework where system messages are given higher priority, and models are trained to ignore lower-priority instructions when they conflict with higher-priority ones.
The authors generate training data using two main approaches: context synthesis for aligned instructions and context ignorance for misaligned instructions. They use synthetic data generation and context distillation to create examples that teach LLMs to follow higher-priority instructions while ignoring lower-priority ones. The results show that the instruction hierarchy significantly improves the model's robustness against various attacks, including system prompt extraction and jailbreaks, while maintaining its ability to perform standard tasks.
The paper also discusses the challenges of over-refusal, where models may refuse to follow benign instructions. The authors find that while some regressions occur, the overall performance of the models remains strong, and further data collection can help improve the refusal decision boundary.
In conclusion, the instruction hierarchy provides a framework for teaching LLMs to follow instructions while ignoring adversarial manipulation. The approach has shown promising results in improving the robustness of LLMs against various attacks, making them safer and more controllable. Future work includes refining the handling of conflicting instructions, expanding to other modalities, and exploring model architecture changes to better instill the instruction hierarchy.