ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence

ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence

2024-06-10 | Kevin Wu, Eric Wu, James Zou
ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence Large language models (LLMs) are prone to hallucinations and incorrect answers. Retrieval augmented generation (RAG) is a common framework that provides relevant retrieved content in the LLM prompt and can significantly improve model accuracy. However, RAG can lead to LLMs adopting incorrect retrieved content, overriding their own correct prior knowledge over 60% of the time. The more unrealistic the retrieved content is, the less likely the model is to adopt it. Also, the less confident a model is in its initial response, the more likely it is to adopt the information in the retrieved content. To address this, we curate a dataset of over 1200 questions across six domains, along with content relevant to answering each question. We apply precise perturbations to the answers in the content that range from subtle to blatant errors. We benchmark six top-performing LLMs, including GPT-4o, on this dataset and find that LLMs are susceptible to adopting incorrect retrieved content. However, the more unrealistic the retrieved content is, the less likely the model is to adopt it. Also, the less confident a model is in its initial response, the more likely it is to adopt the information in the retrieved content. Our results highlight a difficult task and benchmark for LLMs – namely, their ability to correctly discern when it is wrong in light of correct retrieved content and to reject cases when the provided content is incorrect. Our dataset, called ClashEval, and evaluations are open-sourced to allow for future benchmarking on top-performing models at https://github.com/kevinwu23/StanfordClashEval. Our contributions include introducing ClashEval, a question-answering benchmark dataset of over 1200 questions spanning six domains that include the relevant contextual document for answering each question. The answer in each document is perturbed across a range of erroneous values, from subtle to extreme. We benchmark six top-performing LLMs on this dataset and report three relevant metrics. We provide a systematic analysis of context preference rates across three models on (1) varying degrees of perturbation on the contextual information and (2) the token probabilities of the prior responses. We propose a simple way to improve performance on ClashEval by incorporating token probabilities.ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence Large language models (LLMs) are prone to hallucinations and incorrect answers. Retrieval augmented generation (RAG) is a common framework that provides relevant retrieved content in the LLM prompt and can significantly improve model accuracy. However, RAG can lead to LLMs adopting incorrect retrieved content, overriding their own correct prior knowledge over 60% of the time. The more unrealistic the retrieved content is, the less likely the model is to adopt it. Also, the less confident a model is in its initial response, the more likely it is to adopt the information in the retrieved content. To address this, we curate a dataset of over 1200 questions across six domains, along with content relevant to answering each question. We apply precise perturbations to the answers in the content that range from subtle to blatant errors. We benchmark six top-performing LLMs, including GPT-4o, on this dataset and find that LLMs are susceptible to adopting incorrect retrieved content. However, the more unrealistic the retrieved content is, the less likely the model is to adopt it. Also, the less confident a model is in its initial response, the more likely it is to adopt the information in the retrieved content. Our results highlight a difficult task and benchmark for LLMs – namely, their ability to correctly discern when it is wrong in light of correct retrieved content and to reject cases when the provided content is incorrect. Our dataset, called ClashEval, and evaluations are open-sourced to allow for future benchmarking on top-performing models at https://github.com/kevinwu23/StanfordClashEval. Our contributions include introducing ClashEval, a question-answering benchmark dataset of over 1200 questions spanning six domains that include the relevant contextual document for answering each question. The answer in each document is perturbed across a range of erroneous values, from subtle to extreme. We benchmark six top-performing LLMs on this dataset and report three relevant metrics. We provide a systematic analysis of context preference rates across three models on (1) varying degrees of perturbation on the contextual information and (2) the token probabilities of the prior responses. We propose a simple way to improve performance on ClashEval by incorporating token probabilities.
Reach us at info@study.space
[slides] ClashEval%3A Quantifying the tug-of-war between an LLM's internal prior and external evidence | StudySpace