Understanding A Critical Evaluation of AI Feedback for Aligning Large Language Models

This paper critically evaluates the effectiveness of Reinforcement Learning with AI Feedback (RLAIF) in improving the instruction-following abilities of large language models (LLMs). RLAIF involves two stages: supervised fine-tuning (SFT) using demonstrations from a teacher model and reinforcement learning (RL) using feedback from a critic model. While recent open-source models have shown significant improvements from the RL step, the paper questions whether this complexity is truly warranted. It finds that the improvements from the RL step are largely due to using a weaker teacher model (e.g., GPT-3.5) for SFT data collection compared to the stronger critic model (e.g., GPT-4). Specifically, simple SFT with GPT-4 as the teacher outperforms existing RLAIF pipelines. The paper also finds that the gains from RLAIF vary significantly across different base model families, test-time evaluation protocols, and critic models. It provides a mechanistic explanation for when SFT may outperform the full RLAIF pipeline and offers suggestions for making RLAIF more effective in practice. The code for the experiments is available at <https://github.com/architsharma97/dpo-rlaif>.This paper critically evaluates the effectiveness of Reinforcement Learning with AI Feedback (RLAIF) in improving the instruction-following abilities of large language models (LLMs). RLAIF involves two stages: supervised fine-tuning (SFT) using demonstrations from a teacher model and reinforcement learning (RL) using feedback from a critic model. While recent open-source models have shown significant improvements from the RL step, the paper questions whether this complexity is truly warranted. It finds that the improvements from the RL step are largely due to using a weaker teacher model (e.g., GPT-3.5) for SFT data collection compared to the stronger critic model (e.g., GPT-4). Specifically, simple SFT with GPT-4 as the teacher outperforms existing RLAIF pipelines. The paper also finds that the gains from RLAIF vary significantly across different base model families, test-time evaluation protocols, and critic models. It provides a mechanistic explanation for when SFT may outperform the full RLAIF pipeline and offers suggestions for making RLAIF more effective in practice. The code for the experiments is available at <https://github.com/architsharma97/dpo-rlaif>.

A Critical Evaluation of AI Feedback for Aligning Large Language Models

19 Feb 2024 | Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, Thomas Kollar