2024 | Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez
This paper investigates whether weaker models can assess the correctness of stronger models through debate, a method where two experts argue for opposing answers and a non-expert selects the correct one. The study is conducted on the QuALITY dataset, which contains reading comprehension questions based on science fiction stories from Project Gutenberg. The experiments involve both human and large language model (LLM) judges, with the latter being optimized for persuasiveness using techniques like best-of-N sampling and critique-and-refinement.
The results show that debate significantly improves the accuracy of both human and LLM judges in answering questions compared to a baseline method called consultancy, where a single expert argues for one answer. Debate outperforms consultancy in accuracy, with human judges achieving 88% accuracy and LLM judges 76% accuracy, compared to 60% and 48% for the naive baselines. Additionally, optimizing debaters for persuasiveness enhances the ability of non-expert judges to identify the truth in debates.
The study also finds that more persuasive debaters are better at arguing for the correct answer, and that human judges are well-calibrated and achieve lower error rates when using debate. Furthermore, stronger judges are better at identifying correct arguments across all debater strengths, and higher aggregate Elo ratings (a measure of persuasiveness) correlate with higher judge accuracy.
The paper concludes that debate is a promising method for scalable oversight of LLMs, as it allows non-experts to identify the correct answers even when the underlying information is not accessible. The findings suggest that optimizing models for persuasiveness can lead to more truthful outcomes, and that debate can be used to augment human judgments and generate accurate labels for questions beyond their knowledge. The study provides empirical evidence that debate can be an effective method for aligning models with desired behavior in the absence of ground truth.This paper investigates whether weaker models can assess the correctness of stronger models through debate, a method where two experts argue for opposing answers and a non-expert selects the correct one. The study is conducted on the QuALITY dataset, which contains reading comprehension questions based on science fiction stories from Project Gutenberg. The experiments involve both human and large language model (LLM) judges, with the latter being optimized for persuasiveness using techniques like best-of-N sampling and critique-and-refinement.
The results show that debate significantly improves the accuracy of both human and LLM judges in answering questions compared to a baseline method called consultancy, where a single expert argues for one answer. Debate outperforms consultancy in accuracy, with human judges achieving 88% accuracy and LLM judges 76% accuracy, compared to 60% and 48% for the naive baselines. Additionally, optimizing debaters for persuasiveness enhances the ability of non-expert judges to identify the truth in debates.
The study also finds that more persuasive debaters are better at arguing for the correct answer, and that human judges are well-calibrated and achieve lower error rates when using debate. Furthermore, stronger judges are better at identifying correct arguments across all debater strengths, and higher aggregate Elo ratings (a measure of persuasiveness) correlate with higher judge accuracy.
The paper concludes that debate is a promising method for scalable oversight of LLMs, as it allows non-experts to identify the correct answers even when the underlying information is not accessible. The findings suggest that optimizing models for persuasiveness can lead to more truthful outcomes, and that debate can be used to augment human judgments and generate accurate labels for questions beyond their knowledge. The study provides empirical evidence that debate can be an effective method for aligning models with desired behavior in the absence of ground truth.