A Critical Evaluation of AI Feedback for Aligning Large Language Models

A Critical Evaluation of AI Feedback for Aligning Large Language Models

2024 | Archit Sharma¹, Sedrick Keh², Eric Mitchell¹, Chelsea Finn¹, Kushal Arora², Thomas Kollar²
This paper critically evaluates the effectiveness of Reinforcement Learning with AI Feedback (RLAIF) for improving instruction-following abilities of large language models (LLMs). The study challenges the assumption that the complex RL step in RLAIF is necessary for effective alignment. The authors show that the improvements from the RL step are largely due to the use of a weaker teacher model (e.g., GPT-3.5) for supervised fine-tuning (SFT) compared to a stronger critic model (e.g., GPT-4) used for AI feedback generation. They demonstrate that simple SFT using GPT-4 as the teacher outperforms existing RLAIF pipelines. The study also finds that the effectiveness of RLAIF varies across base model families, test-time evaluation protocols, and critic models. The authors provide a mechanistic explanation for when SFT may outperform RLAIF and suggest ways to make RLAIF more effective in practice. They also highlight that current evaluation methods may overestimate the significance of improvements in instruction-following for open-source LLMs. The paper concludes that while RLAIF can be effective, its performance is heavily dependent on the quality of the SFT data and the alignment between the teacher and critic models. The authors recommend using more recent and performant LLMs for instruction-tuning datasets and suggest that future research should focus on improving the effectiveness of AI feedback for training instruction-following models.This paper critically evaluates the effectiveness of Reinforcement Learning with AI Feedback (RLAIF) for improving instruction-following abilities of large language models (LLMs). The study challenges the assumption that the complex RL step in RLAIF is necessary for effective alignment. The authors show that the improvements from the RL step are largely due to the use of a weaker teacher model (e.g., GPT-3.5) for supervised fine-tuning (SFT) compared to a stronger critic model (e.g., GPT-4) used for AI feedback generation. They demonstrate that simple SFT using GPT-4 as the teacher outperforms existing RLAIF pipelines. The study also finds that the effectiveness of RLAIF varies across base model families, test-time evaluation protocols, and critic models. The authors provide a mechanistic explanation for when SFT may outperform RLAIF and suggest ways to make RLAIF more effective in practice. They also highlight that current evaluation methods may overestimate the significance of improvements in instruction-following for open-source LLMs. The paper concludes that while RLAIF can be effective, its performance is heavily dependent on the quality of the SFT data and the alignment between the teacher and critic models. The authors recommend using more recent and performant LLMs for instruction-tuning datasets and suggest that future research should focus on improving the effectiveness of AI feedback for training instruction-following models.
Reach us at info@study.space
[slides and audio] A Critical Evaluation of AI Feedback for Aligning Large Language Models