29 Feb 2024 | Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, Mathieu Blondel
The paper introduces Online AI Feedback (OAIIF), a method that enhances Direct Alignment from Preferences (DAP) methods by incorporating online AI feedback. DAP methods, such as Direct Preference Optimization (DPO), Sequence Likelihood Calibration with Human Feedback (SLiC), and Identity Policy Optimization (IPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) without the need for a separate reward model. However, these methods typically use offline preference datasets, which can lead to distribution shifts and off-policy learning issues.
OAIIF addresses these problems by using an LLM as an annotator to provide online feedback. On each training iteration, two responses are sampled from the current model, and the LLM is prompted to choose the preferred response. This feedback is then used to update the model through standard DAP losses. The method is shown to outperform both offline DAP and RLHF methods in various tasks, as demonstrated by human evaluations.
Key contributions of OAIIF include:
1. **Effectiveness**: OAIIF significantly improves the performance of DAP methods compared to their offline counterparts.
2. **Generality**: The method is compatible with various DAP loss functions, including DPO, IPO, and SLiC.
3. **On-policy Learning**: OAIIF achieves on-policy learning by sampling responses from the current model, ensuring that the model receives feedback on its own generations.
4. **Controllability**: The feedback can be easily controlled by modifying prompts to the LLM annotator, allowing for adjustments to response length and other qualitative desiderata.
The paper also discusses the limitations of OAIIF, such as the impact of LLM size on performance and the need for further research on out-of-distribution prompts and scaling up the method. Overall, OAIIF provides a scalable and effective approach to aligning large language models with human values, reducing the need for extensive human annotation.The paper introduces Online AI Feedback (OAIIF), a method that enhances Direct Alignment from Preferences (DAP) methods by incorporating online AI feedback. DAP methods, such as Direct Preference Optimization (DPO), Sequence Likelihood Calibration with Human Feedback (SLiC), and Identity Policy Optimization (IPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) without the need for a separate reward model. However, these methods typically use offline preference datasets, which can lead to distribution shifts and off-policy learning issues.
OAIIF addresses these problems by using an LLM as an annotator to provide online feedback. On each training iteration, two responses are sampled from the current model, and the LLM is prompted to choose the preferred response. This feedback is then used to update the model through standard DAP losses. The method is shown to outperform both offline DAP and RLHF methods in various tasks, as demonstrated by human evaluations.
Key contributions of OAIIF include:
1. **Effectiveness**: OAIIF significantly improves the performance of DAP methods compared to their offline counterparts.
2. **Generality**: The method is compatible with various DAP loss functions, including DPO, IPO, and SLiC.
3. **On-policy Learning**: OAIIF achieves on-policy learning by sampling responses from the current model, ensuring that the model receives feedback on its own generations.
4. **Controllability**: The feedback can be easily controlled by modifying prompts to the LLM annotator, allowing for adjustments to response length and other qualitative desiderata.
The paper also discusses the limitations of OAIIF, such as the impact of LLM size on performance and the need for further research on out-of-distribution prompts and scaling up the method. Overall, OAIIF provides a scalable and effective approach to aligning large language models with human values, reducing the need for extensive human annotation.