July 14-18, 2024 | James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, Noah Hibble
This paper discusses the evaluation of machine-generated reports, emphasizing the need for completeness, accuracy, and verifiability in long-form reports. Large Language Models (LLMs) have enabled new ways to satisfy information needs, but they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy complex, nuanced, or multi-faceted information needs. The authors propose an evaluation framework called ARGUE (Automated Report Generation Under Evaluation) that draws on ideas from prior evaluations in information retrieval, summarization, and text generation. The framework uses "nuggets" of information, expressed as questions and answers, that need to be part of any high-quality generated report. Citations are also a key component of the framework, ensuring that claims made in the report are mapped to their source documents for verifiability.
The framework is designed to evaluate automated report generation by ensuring that reports are responsive to the report request, that the key information is attested in the document collection, that the report properly cites those documents, and that the information they contain is faithfully captured by the report. The evaluation also considers the reusability of the framework and the need to avoid circularity between report generation and report evaluation. The authors argue that the evaluation must have the intention of reusability, which can be challenging due to the level of interpretation required to make the required judgments.
The paper also discusses the requirements of a report evaluation system, including the need for a report requester, report audience, report writer, and assessor. The evaluation must ensure that the report is responsive to the report request, that the key information is attested in the document collection, that the report properly cites those documents, and that the information they contain is faithfully captured by the report. The evaluation must also consider the fluency, coherence, consistency, and rhetorical structure of the report.
The paper reviews related work on report writing and evaluation, including summarization, retrieval-augmented generation, and question answering. It also discusses the challenges of evaluating machine-generated reports, including the need to account for hallucination and the importance of citations. The authors propose that the evaluation framework should be flexible and adaptable to different types of report generation tasks. The framework is designed to be used by the TREC track NeuCLIR in its report generation task. The authors conclude that the evaluation framework should be able to handle a wide range of report generation tasks and that it should be able to assess the effectiveness of different report generation systems.This paper discusses the evaluation of machine-generated reports, emphasizing the need for completeness, accuracy, and verifiability in long-form reports. Large Language Models (LLMs) have enabled new ways to satisfy information needs, but they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy complex, nuanced, or multi-faceted information needs. The authors propose an evaluation framework called ARGUE (Automated Report Generation Under Evaluation) that draws on ideas from prior evaluations in information retrieval, summarization, and text generation. The framework uses "nuggets" of information, expressed as questions and answers, that need to be part of any high-quality generated report. Citations are also a key component of the framework, ensuring that claims made in the report are mapped to their source documents for verifiability.
The framework is designed to evaluate automated report generation by ensuring that reports are responsive to the report request, that the key information is attested in the document collection, that the report properly cites those documents, and that the information they contain is faithfully captured by the report. The evaluation also considers the reusability of the framework and the need to avoid circularity between report generation and report evaluation. The authors argue that the evaluation must have the intention of reusability, which can be challenging due to the level of interpretation required to make the required judgments.
The paper also discusses the requirements of a report evaluation system, including the need for a report requester, report audience, report writer, and assessor. The evaluation must ensure that the report is responsive to the report request, that the key information is attested in the document collection, that the report properly cites those documents, and that the information they contain is faithfully captured by the report. The evaluation must also consider the fluency, coherence, consistency, and rhetorical structure of the report.
The paper reviews related work on report writing and evaluation, including summarization, retrieval-augmented generation, and question answering. It also discusses the challenges of evaluating machine-generated reports, including the need to account for hallucination and the importance of citations. The authors propose that the evaluation framework should be flexible and adaptable to different types of report generation tasks. The framework is designed to be used by the TREC track NeuCLIR in its report generation task. The authors conclude that the evaluation framework should be able to handle a wide range of report generation tasks and that it should be able to assess the effectiveness of different report generation systems.