10 Jul 2024 | Ori Press, Andreas Hochlehner, Ameya Prabhu, Vishaal Udandara, Ofir Press, and Matthias Bethge
CiteME: Can Language Models Accurately Cite Scientific Claims?
This paper introduces CiteME, a new benchmark for evaluating the ability of language models (LMs) to correctly attribute scientific claims to their sources. The benchmark consists of text excerpts from recent machine learning papers, each referencing a single other paper. The goal is to assess whether LMs can act as research assistants to identify the referenced paper. The benchmark reveals a significant gap between LMs and human performance, with LMs achieving only 4.2-18.5% accuracy and humans 69.7%. To address this gap, the authors introduce CiteAgent, an autonomous system built on the GPT-4o LM that can search and read papers, achieving 35.3% accuracy on CiteME. CiteME serves as a challenging testbed for open-ended claim attribution, driving the research community towards a future where any claim made by an LM can be automatically verified and discarded if found to be incorrect.
The paper also presents an analysis of the performance of various LMs and retrieval methods on CiteME. The results show that GPT-4o performs the best, achieving 35.3% accuracy, while previous state-of-the-art models like SPECTER2 and SPECTER achieve 0% accuracy. Human performance on the same task is 69.7% accuracy, with less than a minute of search time. The paper also discusses the limitations of current LMs in accurately attributing scientific claims and the potential of CiteAgent to improve this ability by searching and reading papers.
The authors also analyze the errors made by CiteAgent, identifying three main types of errors: misunderstanding the excerpt, stopping prematurely, and finding the correct citation but stopping prematurely. The paper concludes that current LMs cannot reliably link scientific claims to their sources and that further research is needed to improve their performance. The authors also discuss the importance of real-world applicability of their work, noting that their agent is based on state-of-the-art LMs, requires no additional training, and can use a search engine, making it easily applicable to real-world settings.CiteME: Can Language Models Accurately Cite Scientific Claims?
This paper introduces CiteME, a new benchmark for evaluating the ability of language models (LMs) to correctly attribute scientific claims to their sources. The benchmark consists of text excerpts from recent machine learning papers, each referencing a single other paper. The goal is to assess whether LMs can act as research assistants to identify the referenced paper. The benchmark reveals a significant gap between LMs and human performance, with LMs achieving only 4.2-18.5% accuracy and humans 69.7%. To address this gap, the authors introduce CiteAgent, an autonomous system built on the GPT-4o LM that can search and read papers, achieving 35.3% accuracy on CiteME. CiteME serves as a challenging testbed for open-ended claim attribution, driving the research community towards a future where any claim made by an LM can be automatically verified and discarded if found to be incorrect.
The paper also presents an analysis of the performance of various LMs and retrieval methods on CiteME. The results show that GPT-4o performs the best, achieving 35.3% accuracy, while previous state-of-the-art models like SPECTER2 and SPECTER achieve 0% accuracy. Human performance on the same task is 69.7% accuracy, with less than a minute of search time. The paper also discusses the limitations of current LMs in accurately attributing scientific claims and the potential of CiteAgent to improve this ability by searching and reading papers.
The authors also analyze the errors made by CiteAgent, identifying three main types of errors: misunderstanding the excerpt, stopping prematurely, and finding the correct citation but stopping prematurely. The paper concludes that current LMs cannot reliably link scientific claims to their sources and that further research is needed to improve their performance. The authors also discuss the importance of real-world applicability of their work, noting that their agent is based on state-of-the-art LMs, requires no additional training, and can use a search engine, making it easily applicable to real-world settings.