[slides] SoFA%3A Shielded On-the-fly Alignment via Priority Rule Following

The paper introduces a novel alignment paradigm called *priority distillation* to address the challenge of aligning Large Language Models (LLMs) with a broad spectrum of human values and preferences. The paradigm defines *rules* as the primary control mechanism in each dialog, prioritizing them over user instructions. The authors find that even advanced LLMs like GPT-4 struggle with understanding and prioritizing these rules. To enhance the LLMs' ability to integrate and maintain rules, they propose PRIORITYDISTILL, a semi-automated process that distills priority-following signals from LLM simulations. This process involves harvesting diverse rules and instructions, simulating priority-following scenarios, and distilling the learned signals into the model's parameters. Experiments show that PRIORITYDISTILL effectively minimizes misalignments using a single general rule and adapts smoothly to various unseen rules, ensuring robust rule integration and adherence. The contributions of the paper include introducing the priority distillation paradigm, proposing PRIORITYDISTILL, and identifying a benchmark dataset for evaluating on-the-fly alignment capabilities. The paper also discusses limitations and ethical considerations, emphasizing the importance of preventing potential misuse of on-the-fly aligned LLMs.The paper introduces a novel alignment paradigm called *priority distillation* to address the challenge of aligning Large Language Models (LLMs) with a broad spectrum of human values and preferences. The paradigm defines *rules* as the primary control mechanism in each dialog, prioritizing them over user instructions. The authors find that even advanced LLMs like GPT-4 struggle with understanding and prioritizing these rules. To enhance the LLMs' ability to integrate and maintain rules, they propose PRIORITYDISTILL, a semi-automated process that distills priority-following signals from LLM simulations. This process involves harvesting diverse rules and instructions, simulating priority-following scenarios, and distilling the learned signals into the model's parameters. Experiments show that PRIORITYDISTILL effectively minimizes misalignments using a single general rule and adapts smoothly to various unseen rules, ensuring robust rule integration and adherence. The contributions of the paper include introducing the priority distillation paradigm, proposing PRIORITYDISTILL, and identifying a benchmark dataset for evaluating on-the-fly alignment capabilities. The paper also discusses limitations and ethical considerations, emphasizing the importance of preventing potential misuse of on-the-fly aligned LLMs.

SoFA: Shielded On-the-fly Alignment via Priority Rule Following

27 Feb 2024 | Xinyu Lu, Bowen Yu, Yaojie Lu, Hongyu Lin, Haiyang Yu, Le Sun, Xianpei Han, Yongbin Li