LLM Critics Help Catch LLM Bugs

LLM Critics Help Catch LLM Bugs

28 Jun 2024 | Nat McAleese*, Rai (Michael Pokorny)*, Juan Felipe Cerón Uribe*, Evgenia Nitishinskaya*, Maja Trębacz*, Jan Leike†
This paper presents a method for improving human evaluation of large language model (LLM) outputs by training "critic" models that help humans more accurately evaluate model-written code. These critics are themselves LLMs trained with reinforcement learning from human feedback (RLHF) to write natural language feedback highlighting problems in code from real-world assistant tasks. On code containing naturally occurring LLM errors, model-written critiques are preferred over human critiques in 63% of cases, and human evaluation finds that models catch more bugs than human contractors paid for code review. The fine-tuned LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as "flawless", even though most of those tasks are non-code. Critics can have limitations, including hallucinated bugs that could mislead humans, but human-machine teams of critics and contractors catch similar numbers of bugs to LLM critics while hallucinating less than LLMs alone. The core idea of the approach is to train an autoregressive policy that accepts a (question, answer) pair as input and outputs a text critique pointing out errors in the answer. This model, called CriticGPT, outperforms representative humans at challenging bug detection tasks. The results show that LLMs catch substantially more inserted bugs than qualified humans paid for code review, and model critiques are preferred over human critiques more than 80% of the time. Human-machine teams of contractors assisted by critic models write more comprehensive critiques than contractors alone while reducing the hallucination rate compared to models. The paper also investigates the tradeoff between comprehensiveness and hallucinations in critiques. It introduces a sampling and scoring strategy, Force Sampling Beam Search (FSBS), that balances the tradeoff between the number of real and spurious issues included in LLM critiques. The results show that FSBS allows a good tradeoff for RLHF data collection to be selected at deployment time without re-training the critique model. The paper also discusses the limitations of the approach, including the difficulty of detecting bugs in real-world code and the potential for critics to introduce new biases. The study concludes that LLM critics are a promising start for scalable oversight, helping humans evaluate model output more accurately.This paper presents a method for improving human evaluation of large language model (LLM) outputs by training "critic" models that help humans more accurately evaluate model-written code. These critics are themselves LLMs trained with reinforcement learning from human feedback (RLHF) to write natural language feedback highlighting problems in code from real-world assistant tasks. On code containing naturally occurring LLM errors, model-written critiques are preferred over human critiques in 63% of cases, and human evaluation finds that models catch more bugs than human contractors paid for code review. The fine-tuned LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as "flawless", even though most of those tasks are non-code. Critics can have limitations, including hallucinated bugs that could mislead humans, but human-machine teams of critics and contractors catch similar numbers of bugs to LLM critics while hallucinating less than LLMs alone. The core idea of the approach is to train an autoregressive policy that accepts a (question, answer) pair as input and outputs a text critique pointing out errors in the answer. This model, called CriticGPT, outperforms representative humans at challenging bug detection tasks. The results show that LLMs catch substantially more inserted bugs than qualified humans paid for code review, and model critiques are preferred over human critiques more than 80% of the time. Human-machine teams of contractors assisted by critic models write more comprehensive critiques than contractors alone while reducing the hallucination rate compared to models. The paper also investigates the tradeoff between comprehensiveness and hallucinations in critiques. It introduces a sampling and scoring strategy, Force Sampling Beam Search (FSBS), that balances the tradeoff between the number of real and spurious issues included in LLM critiques. The results show that FSBS allows a good tradeoff for RLHF data collection to be selected at deployment time without re-training the critique model. The paper also discusses the limitations of the approach, including the difficulty of detecting bugs in real-world code and the potential for critics to introduce new biases. The study concludes that LLM critics are a promising start for scalable oversight, helping humans evaluate model output more accurately.
Reach us at info@study.space
[slides] LLM Critics Help Catch LLM Bugs | StudySpace