28 Jun 2024 | Nat McAleese, Rai (Michael Pokorny), Juan Felipe Cerón Uribe, Evgenia Nitishinskaya, Maja Trębacz, Jan Leike
This paper addresses the limitations of human evaluation in assessing the output of large language models (LLMs) trained through reinforcement learning from human feedback (RLHF). To improve this process, the authors train "critic" models, which are themselves LLMs, to provide natural language feedback on code written by LLMs. These critics are trained using RLHF to highlight problems in code from real-world assistant tasks. The results show that model-written critiques are preferred over human critiques in 63% of cases and that models catch more bugs than human contractors paid for code review. The authors also investigate the effectiveness of human-machine teams, finding that they produce more comprehensive critiques while reducing hallucinations and nitpicks compared to models alone. The paper introduces Force Sampling Beam Search (FSBS), a sampling strategy that balances the tradeoff between the number of real and spurious issues in LLM critiques. The critics are evaluated on two datasets: Human Inserted Bugs, where contractors insert subtle bugs, and Human Detected Bugs, where bugs are already known. The results demonstrate that LLM critics can catch more bugs and improve the comprehensiveness of critiques, even on tasks outside the distribution of model errors. The paper concludes by discussing the limitations and future directions, emphasizing the need for scalable oversight methods to ensure that AI systems reward the right behaviors as they become smarter than humans.This paper addresses the limitations of human evaluation in assessing the output of large language models (LLMs) trained through reinforcement learning from human feedback (RLHF). To improve this process, the authors train "critic" models, which are themselves LLMs, to provide natural language feedback on code written by LLMs. These critics are trained using RLHF to highlight problems in code from real-world assistant tasks. The results show that model-written critiques are preferred over human critiques in 63% of cases and that models catch more bugs than human contractors paid for code review. The authors also investigate the effectiveness of human-machine teams, finding that they produce more comprehensive critiques while reducing hallucinations and nitpicks compared to models alone. The paper introduces Force Sampling Beam Search (FSBS), a sampling strategy that balances the tradeoff between the number of real and spurious issues in LLM critiques. The critics are evaluated on two datasets: Human Inserted Bugs, where contractors insert subtle bugs, and Human Detected Bugs, where bugs are already known. The results demonstrate that LLM critics can catch more bugs and improve the comprehensiveness of critiques, even on tasks outside the distribution of model errors. The paper concludes by discussing the limitations and future directions, emphasizing the need for scalable oversight methods to ensure that AI systems reward the right behaviors as they become smarter than humans.