SoFA: Shielded On-the-fly Alignment via Priority Rule Following

SoFA: Shielded On-the-fly Alignment via Priority Rule Following

27 Feb 2024 | Xinyu Lu, Bowen Yu, Yaojie Lu, Hongyu Lin, Haiyang Yu, Le Sun, Xianpei Han, Yongbin Li
SoFA: Shielded On-the-fly Alignment via Priority Rule Following introduces a novel alignment paradigm for Large Language Models (LLMs) called priority rule following. This approach defines rules as the primary control mechanism in each dialog, prioritizing them over user instructions. The paper presents PRIORITYDISTILL, a semi-automated method for distilling priority following signals from LLM simulations to ensure robust rule integration and adherence. The method effectively minimizes misalignments using a single general rule and adapts smoothly to various unseen rules, ensuring they are shielded from hijacking and that the model responds appropriately. The alignment problem in LLMs involves adapting them to the broad spectrum of human values, which challenges existing alignment methods due to the diversity of preferences and regulatory standards. Current alignment technologies, such as Reinforcement Learning from Human Feedback (RLHF), rely on annotating preference data, making them time-consuming and expensive. On the other hand, controlling instructions like "You are a helpful assistant" can lead to conflicts between unclear boundaries and complex relationships between regular instructions and controlling ones, resulting in confused and hijacked model responses. The paper proposes a mechanism that enables LLMs to clearly distinguish controlling instructions from other instructions and train these models to better integrate the rules, ensuring that controlling instructions are shielded from hijacking and that the model responds appropriately. The alignment paradigm of priority rule following defines rules as a controlling strategy for each dialogue and prioritizes these rules above all user instructions. The paper also identifies and further annotates a set of benchmarks to examine the model's proficiency in on-the-fly alignment, providing a resource that can benefit future research. The PRIORITYDISTILL process distills priority following signals from LLM simulations, focusing on three main challenges: identifying high-quality (r,i) pairs, obtaining the appropriate response signal y, and effectively learning the (r,i,y) triplets. The simulation process includes three steps: harvesting rules and instructions, automatic probe and constrain generation, and priority distillation. The final step, termed priority distillation, is aimed at getting the corresponding response y that meets the key properties outlined in Section 3. The PRIORITYRULES dataset, generated through the simulation process, contains over 20K rules and 42K corresponding instructions. The dataset is used to train models to follow rules and ensure that the model's alignment process is based on the rules. The paper also introduces a reference signal to prevent the model from directly memorizing instruction-response pairs. The experiments show that the proposed method not only effectively reduces misaligned behaviors using a single general rule but also effectively applies to various unseen rules, rejecting the harmful ones. The paper's contributions include introducing a novel alignment paradigm that trains models to better integrate and maintain rules, proposing PRIORITYDISTILL, a semi-automated process that improves the model’s ability to integrate and maintain rules, and identifying and further annotating a set of benchmarks to examine the model’s proficiency in on-theSoFA: Shielded On-the-fly Alignment via Priority Rule Following introduces a novel alignment paradigm for Large Language Models (LLMs) called priority rule following. This approach defines rules as the primary control mechanism in each dialog, prioritizing them over user instructions. The paper presents PRIORITYDISTILL, a semi-automated method for distilling priority following signals from LLM simulations to ensure robust rule integration and adherence. The method effectively minimizes misalignments using a single general rule and adapts smoothly to various unseen rules, ensuring they are shielded from hijacking and that the model responds appropriately. The alignment problem in LLMs involves adapting them to the broad spectrum of human values, which challenges existing alignment methods due to the diversity of preferences and regulatory standards. Current alignment technologies, such as Reinforcement Learning from Human Feedback (RLHF), rely on annotating preference data, making them time-consuming and expensive. On the other hand, controlling instructions like "You are a helpful assistant" can lead to conflicts between unclear boundaries and complex relationships between regular instructions and controlling ones, resulting in confused and hijacked model responses. The paper proposes a mechanism that enables LLMs to clearly distinguish controlling instructions from other instructions and train these models to better integrate the rules, ensuring that controlling instructions are shielded from hijacking and that the model responds appropriately. The alignment paradigm of priority rule following defines rules as a controlling strategy for each dialogue and prioritizes these rules above all user instructions. The paper also identifies and further annotates a set of benchmarks to examine the model's proficiency in on-the-fly alignment, providing a resource that can benefit future research. The PRIORITYDISTILL process distills priority following signals from LLM simulations, focusing on three main challenges: identifying high-quality (r,i) pairs, obtaining the appropriate response signal y, and effectively learning the (r,i,y) triplets. The simulation process includes three steps: harvesting rules and instructions, automatic probe and constrain generation, and priority distillation. The final step, termed priority distillation, is aimed at getting the corresponding response y that meets the key properties outlined in Section 3. The PRIORITYRULES dataset, generated through the simulation process, contains over 20K rules and 42K corresponding instructions. The dataset is used to train models to follow rules and ensure that the model's alignment process is based on the rules. The paper also introduces a reference signal to prevent the model from directly memorizing instruction-response pairs. The experiments show that the proposed method not only effectively reduces misaligned behaviors using a single general rule but also effectively applies to various unseen rules, rejecting the harmful ones. The paper's contributions include introducing a novel alignment paradigm that trains models to better integrate and maintain rules, proposing PRIORITYDISTILL, a semi-automated process that improves the model’s ability to integrate and maintain rules, and identifying and further annotating a set of benchmarks to examine the model’s proficiency in on-the
Reach us at info@study.space