On scalable oversight with weak LLMs judging strong LLMs

On scalable oversight with weak LLMs judging strong LLMs

2024-7-15 | Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, Rohin Shah
This paper evaluates scalable oversight protocols for superhuman AI, focusing on debate, consultancy, and direct question-answering. The study uses large language models (LLMs) as both AI agents and human judges, with judges being weaker than agents. The tasks include extractive QA, closed QA, and multimodal reasoning, covering various capability gaps. The results show that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct or incorrect answer. Debate also performs better than direct question-answering in extractive QA tasks with information asymmetry but not in other tasks. Open debate, where debaters choose their answer, leads to better judge accuracy when the debater is incorrect, compared to open consultancy. Stronger debaters lead to higher judge accuracy, though the effect is modest. The study highlights the potential of debate as a scalable oversight protocol, though it is not a direct training protocol. The findings suggest that debate may be more effective than consultancy in certain scenarios, but further research is needed to validate its effectiveness in training settings. The study also notes that current fine-tuning approaches may favor direct QA over debate, and that future work should explore training debaters via self-play using judgment as a reward signal. The results indicate that debate could be a promising approach for scalable oversight, but its effectiveness as a training protocol remains to be fully evaluated.This paper evaluates scalable oversight protocols for superhuman AI, focusing on debate, consultancy, and direct question-answering. The study uses large language models (LLMs) as both AI agents and human judges, with judges being weaker than agents. The tasks include extractive QA, closed QA, and multimodal reasoning, covering various capability gaps. The results show that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct or incorrect answer. Debate also performs better than direct question-answering in extractive QA tasks with information asymmetry but not in other tasks. Open debate, where debaters choose their answer, leads to better judge accuracy when the debater is incorrect, compared to open consultancy. Stronger debaters lead to higher judge accuracy, though the effect is modest. The study highlights the potential of debate as a scalable oversight protocol, though it is not a direct training protocol. The findings suggest that debate may be more effective than consultancy in certain scenarios, but further research is needed to validate its effectiveness in training settings. The study also notes that current fine-tuning approaches may favor direct QA over debate, and that future work should explore training debaters via self-play using judgment as a reward signal. The results indicate that debate could be a promising approach for scalable oversight, but its effectiveness as a training protocol remains to be fully evaluated.
Reach us at info@study.space