[slides and audio] Don't Hallucinate%2C Abstain%3A Identifying LLM Knowledge Gaps via Multi-LLM Collaboration

This paper addresses the issue of knowledge gaps in large language models (LLMs) and proposes methods to identify and address these gaps by enabling LLMs to abstain from answering questions when they lack sufficient knowledge. The authors first review existing approaches to model calibration, adaptation, and self-reflection, but find that these methods often rely on held-out sets and suffer from limitations such as hallucinations and biases. To overcome these challenges, they introduce two novel collaboration-based approaches: COOPERATE and COMPETE. COOPERATE leverages multiple LLMs to provide feedback on each other's answers, while COMPETE challenges LLMs with conflicting knowledge from other models. Extensive experiments on four QA tasks across diverse knowledge domains demonstrate that both approaches achieve up to 19.3% improvements in abstain accuracy compared to the strongest baseline. The proposed methods also help identify failure cases in retrieval augmentation and pinpoint knowledge gaps in multi-hop reasoning. The paper contributes to the field by providing a critical evaluation of existing methods and proposing robust, multi-LLM collaboration-based approaches to enhance LLM reliability and reduce hallucinations.This paper addresses the issue of knowledge gaps in large language models (LLMs) and proposes methods to identify and address these gaps by enabling LLMs to abstain from answering questions when they lack sufficient knowledge. The authors first review existing approaches to model calibration, adaptation, and self-reflection, but find that these methods often rely on held-out sets and suffer from limitations such as hallucinations and biases. To overcome these challenges, they introduce two novel collaboration-based approaches: COOPERATE and COMPETE. COOPERATE leverages multiple LLMs to provide feedback on each other's answers, while COMPETE challenges LLMs with conflicting knowledge from other models. Extensive experiments on four QA tasks across diverse knowledge domains demonstrate that both approaches achieve up to 19.3% improvements in abstain accuracy compared to the strongest baseline. The proposed methods also help identify failure cases in retrieval augmentation and pinpoint knowledge gaps in multi-hop reasoning. The paper contributes to the field by providing a critical evaluation of existing methods and proposing robust, multi-LLM collaboration-based approaches to enhance LLM reliability and reduce hallucinations.

Don’t Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration

2024-07-01 | Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, Yulia Tsvetkov