Direct Language Model Alignment from Online AI Feedback

Direct Language Model Alignment from Online AI Feedback

29 Feb 2024 | Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, Mathieu Blondel
This paper introduces Online AI Feedback (OAIF), a method to make direct alignment from preferences (DAP) methods online and on-policy. OAIF uses an LLM as an annotator to provide online feedback during training. The method samples two responses from the current model, prompts the LLM annotator to choose the preferred response, and uses this feedback to update the model. OAIF outperforms both offline DAP and RLHF methods in several tasks, and the feedback is easily controllable via instruction prompts to the LLM annotator. The paper also shows that OAIF can be applied to various DAP methods, including DPO, IPO, and SLiC. Experiments demonstrate that OAIF significantly improves the performance of DAP methods and that the LLM annotator can be controlled to achieve desired outcomes. The paper also discusses the limitations of OAIF, including the potential challenges of distribution shifts and the need for further research on the scalability of the method. Overall, OAIF provides a simple and effective way to make DAP methods online and on-policy, improving the alignment of large language models with human values.This paper introduces Online AI Feedback (OAIF), a method to make direct alignment from preferences (DAP) methods online and on-policy. OAIF uses an LLM as an annotator to provide online feedback during training. The method samples two responses from the current model, prompts the LLM annotator to choose the preferred response, and uses this feedback to update the model. OAIF outperforms both offline DAP and RLHF methods in several tasks, and the feedback is easily controllable via instruction prompts to the LLM annotator. The paper also shows that OAIF can be applied to various DAP methods, including DPO, IPO, and SLiC. Experiments demonstrate that OAIF significantly improves the performance of DAP methods and that the LLM annotator can be controlled to achieve desired outcomes. The paper also discusses the limitations of OAIF, including the potential challenges of distribution shifts and the need for further research on the scalability of the method. Overall, OAIF provides a simple and effective way to make DAP methods online and on-policy, improving the alignment of large language models with human values.
Reach us at info@study.space
Understanding Direct Language Model Alignment from Online AI Feedback