Understanding On scalable oversight with weak LLMs judging strong LLMs

This paper explores the effectiveness of scalable oversight protocols, specifically debate and consultancy, in enabling humans to accurately supervise superhuman AI. The study uses large language models (LLMs) as both AI agents and stand-ins for human judges, with the judges being weaker than the agent models. The tasks evaluated include extractive QA, closed QA, and multimodal reasoning, covering a wide range of asymmetries between judges and agents. The results show that debate consistently outperforms consultancy across all tasks, particularly in extractive QA tasks with information asymmetry. However, the performance of debate compared to direct question answering depends on the type of task. The study also finds that stronger debater models increase judge accuracy, though the effect is modest. Additionally, the paper introduces open consultancy and open debate protocols, where the consultant or debater can choose which answer to argue for, and finds that weak judges perform better in these protocols when the consultant/debater chooses correctly. The findings suggest that debate is a promising scalable oversight protocol, though further research is needed to evaluate its effectiveness as a training method.This paper explores the effectiveness of scalable oversight protocols, specifically debate and consultancy, in enabling humans to accurately supervise superhuman AI. The study uses large language models (LLMs) as both AI agents and stand-ins for human judges, with the judges being weaker than the agent models. The tasks evaluated include extractive QA, closed QA, and multimodal reasoning, covering a wide range of asymmetries between judges and agents. The results show that debate consistently outperforms consultancy across all tasks, particularly in extractive QA tasks with information asymmetry. However, the performance of debate compared to direct question answering depends on the type of task. The study also finds that stronger debater models increase judge accuracy, though the effect is modest. Additionally, the paper introduces open consultancy and open debate protocols, where the consultant or debater can choose which answer to argue for, and finds that weak judges perform better in these protocols when the consultant/debater chooses correctly. The findings suggest that debate is a promising scalable oversight protocol, though further research is needed to evaluate its effectiveness as a training method.

On scalable oversight with weak LLMs judging strong LLMs

2024-7-15 | Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, and Rohin Shah