What Evidence Do Language Models Find Convincing?

What Evidence Do Language Models Find Convincing?

2024 | Alexander Wan, Eric Wallace, Dan Klein
This paper investigates how large language models (LLMs) determine the convincingness of evidence when faced with conflicting information. The authors introduce the CONFLICTINGQA dataset, which pairs controversial questions with conflicting evidence documents that contain different facts, argument styles, and answers. They use this dataset to analyze how LLMs evaluate the convincingness of text, focusing on factors such as relevance and stylistic features. The study finds that LLMs primarily rely on the relevance of a document to the query, while largely ignoring stylistic features that humans find important, such as the presence of scientific references or neutral tone. This suggests that current retrieval-augmented (RAG) models may not align well with human judgments of convincingness. The results highlight the importance of RAG corpus quality and suggest that future models should be trained to better align with human preferences. The authors also perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions. They find that simple changes to a document's relevance, such as adding a prefix indicating the document is about the question, can significantly improve its win-rate. In contrast, stylistic features like informational content, references, or confidence tend to have a neutral or negative effect on win-rate. The study emphasizes the need for high-quality evidence in RAG systems and suggests that future research should explore how integrating other forms of information, such as metadata or visual content, can influence these behaviors. The authors also note the potential for LLM-generated content to influence how models judge convincingness, and call for further research into the ethical and societal implications of this. The paper concludes that current LLMs tend to over-rely on relevance and ignore many stylistic features that humans often find important.This paper investigates how large language models (LLMs) determine the convincingness of evidence when faced with conflicting information. The authors introduce the CONFLICTINGQA dataset, which pairs controversial questions with conflicting evidence documents that contain different facts, argument styles, and answers. They use this dataset to analyze how LLMs evaluate the convincingness of text, focusing on factors such as relevance and stylistic features. The study finds that LLMs primarily rely on the relevance of a document to the query, while largely ignoring stylistic features that humans find important, such as the presence of scientific references or neutral tone. This suggests that current retrieval-augmented (RAG) models may not align well with human judgments of convincingness. The results highlight the importance of RAG corpus quality and suggest that future models should be trained to better align with human preferences. The authors also perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions. They find that simple changes to a document's relevance, such as adding a prefix indicating the document is about the question, can significantly improve its win-rate. In contrast, stylistic features like informational content, references, or confidence tend to have a neutral or negative effect on win-rate. The study emphasizes the need for high-quality evidence in RAG systems and suggests that future research should explore how integrating other forms of information, such as metadata or visual content, can influence these behaviors. The authors also note the potential for LLM-generated content to influence how models judge convincingness, and call for further research into the ethical and societal implications of this. The paper concludes that current LLMs tend to over-rely on relevance and ignore many stylistic features that humans often find important.
Reach us at info@study.space
[slides] What Evidence Do Language Models Find Convincing%3F | StudySpace