2024 | Akbir Khan * 1 John Hughes * 2 3 Dan Valentine * 3 Laura Ruis 1 Kshitij Sachan 4 5 Ansh Radhakrishnan 4 Edward Grefenstette 1 Samuel R. Bowman 4 Tim Rocktäschel 1 Ethan Perez 4 6
The paper explores the effectiveness of using debates to align large language models (LLMs) with desired behavior, particularly in the absence of ground truth data. The authors investigate whether weaker models can assess the correctness of stronger models through debates. They use a reading comprehension task (QuALITY) to evaluate the performance of non-expert judges (both human and LLMs) in answering questions when provided with the arguments from two expert models (debaters). The key findings include:
1. **Enhanced Accuracy**: Debate consistently improves the accuracy of both non-expert human judges (88%) and non-expert LLM judges (76%) compared to naive performance (60% for humans and 48% for LLMs). The baseline consultancy protocol, where a single expert model argues for one answer, achieves 78% and 54% accuracy, respectively.
2. **Optimizing Persuasiveness**: Optimizing debaters for persuasiveness through inference-time methods (e.g., best-of-$N$ and critique-and-refinement) enhances the ability of non-expert judges to identify the truth. Models optimized for judge approval (persuasiveness) perform better at arguing for the correct answer compared to the incorrect answer.
3. **Human Judges' Calibration**: Human judges are well-calibrated and achieve lower error rates with debate compared to consultancy. They are underconfident in their answers, which helps in reducing false positives.
4. **Model Generalization**: The findings generalize to different base LLMs and human judges, indicating that the debate protocol is robust to variations in judge skill.
5. **Limitations**: The study has limitations, such as the assumption that stronger models differ only in information access. Future work should explore the impact of reasoning ability and other skills on debate performance.
Overall, the paper provides encouraging evidence that debate can be a viable method for scalable oversight in the absence of ground truth data, paving the way for further research in fine-tuning LLMs via debate.The paper explores the effectiveness of using debates to align large language models (LLMs) with desired behavior, particularly in the absence of ground truth data. The authors investigate whether weaker models can assess the correctness of stronger models through debates. They use a reading comprehension task (QuALITY) to evaluate the performance of non-expert judges (both human and LLMs) in answering questions when provided with the arguments from two expert models (debaters). The key findings include:
1. **Enhanced Accuracy**: Debate consistently improves the accuracy of both non-expert human judges (88%) and non-expert LLM judges (76%) compared to naive performance (60% for humans and 48% for LLMs). The baseline consultancy protocol, where a single expert model argues for one answer, achieves 78% and 54% accuracy, respectively.
2. **Optimizing Persuasiveness**: Optimizing debaters for persuasiveness through inference-time methods (e.g., best-of-$N$ and critique-and-refinement) enhances the ability of non-expert judges to identify the truth. Models optimized for judge approval (persuasiveness) perform better at arguing for the correct answer compared to the incorrect answer.
3. **Human Judges' Calibration**: Human judges are well-calibrated and achieve lower error rates with debate compared to consultancy. They are underconfident in their answers, which helps in reducing false positives.
4. **Model Generalization**: The findings generalize to different base LLMs and human judges, indicating that the debate protocol is robust to variations in judge skill.
5. **Limitations**: The study has limitations, such as the assumption that stronger models differ only in information access. Future work should explore the impact of reasoning ability and other skills on debate performance.
Overall, the paper provides encouraging evidence that debate can be a viable method for scalable oversight in the absence of ground truth data, paving the way for further research in fine-tuning LLMs via debate.