AttributionBench: How Hard is Automatic Attribution Evaluation?

AttributionBench: How Hard is Automatic Attribution Evaluation?

23 Feb 2024 | Yifei Li, Xiang Yue, Zeyi Liao, Huan Sun
AttributionBench: How Hard is Automatic Attribution Evaluation? Modern generative search engines enhance the reliability of large language model (LLM) responses by providing cited evidence. However, evaluating the answer’s attribution, i.e., whether every claim within the generated responses is fully supported by its cited evidence, remains an open problem. This verification, traditionally dependent on costly human evaluation, underscores the urgent need for automatic attribution evaluation methods. To bridge the gap in the absence of standardized benchmarks for these methods, we present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets. Our extensive experiments on AttributionBench reveal the challenges of automatic attribution evaluation, even for state-of-the-art LLMs. Specifically, our findings show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation. A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model’s inability to process nuanced information, and the discrepancy between the information the model has access to and that human annotators do. We propose AttributionBench, a comprehensive benchmark with a unified formulation for attribution evaluation, which will enable the community to compare different methods fairly and track the progress on this important task. We conduct comprehensive experiments and show that existing cutting-edge LLMs like GPT-4 and fine-tuned GPT-3.5 still cannot perform well on this task. Through a series of in-depth error analyses, we show insights into why automatic attribution evaluation is difficult and potential future work. AttributionBench is a benchmark for training and evaluating cutting-edge automatic attribution evaluators. We meticulously sample data from 7 different datasets that cover different domains of questions and diverse responses and evidence. We unify them into a binary classification format with a label-balanced setting for fair comparison. We compile them into a training set and two test sets for in-distribution (ID) and out-of-distribution (OOD) evaluation. We conduct extensive experiments and analysis on our proposed benchmark. Surprisingly, we find that even fine-tuned GPT-3.5 can only get around 80% macro-F1 score under both ID and OOD settings, which is far away from practical use. To better understand the challenges of this task, we carefully labeled over 300 error cases from GPT-3.5 under chain-of-thought (CoT) prompting, which generate rationales for the model's prediction that can reveal reasons for an error. We find that over 66% errors are caused by the model's insensitivity to fine-grained information, and about 26.8% of the errors are caused by the mismatch between information accessible to the model and that accessible to human annotators. Our contributions include: (1) a comprehensive benchmark for automatic attribution evaluation, (2) comprehensive experiments showing that existing cutting-edge LLMs still cannotAttributionBench: How Hard is Automatic Attribution Evaluation? Modern generative search engines enhance the reliability of large language model (LLM) responses by providing cited evidence. However, evaluating the answer’s attribution, i.e., whether every claim within the generated responses is fully supported by its cited evidence, remains an open problem. This verification, traditionally dependent on costly human evaluation, underscores the urgent need for automatic attribution evaluation methods. To bridge the gap in the absence of standardized benchmarks for these methods, we present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets. Our extensive experiments on AttributionBench reveal the challenges of automatic attribution evaluation, even for state-of-the-art LLMs. Specifically, our findings show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation. A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model’s inability to process nuanced information, and the discrepancy between the information the model has access to and that human annotators do. We propose AttributionBench, a comprehensive benchmark with a unified formulation for attribution evaluation, which will enable the community to compare different methods fairly and track the progress on this important task. We conduct comprehensive experiments and show that existing cutting-edge LLMs like GPT-4 and fine-tuned GPT-3.5 still cannot perform well on this task. Through a series of in-depth error analyses, we show insights into why automatic attribution evaluation is difficult and potential future work. AttributionBench is a benchmark for training and evaluating cutting-edge automatic attribution evaluators. We meticulously sample data from 7 different datasets that cover different domains of questions and diverse responses and evidence. We unify them into a binary classification format with a label-balanced setting for fair comparison. We compile them into a training set and two test sets for in-distribution (ID) and out-of-distribution (OOD) evaluation. We conduct extensive experiments and analysis on our proposed benchmark. Surprisingly, we find that even fine-tuned GPT-3.5 can only get around 80% macro-F1 score under both ID and OOD settings, which is far away from practical use. To better understand the challenges of this task, we carefully labeled over 300 error cases from GPT-3.5 under chain-of-thought (CoT) prompting, which generate rationales for the model's prediction that can reveal reasons for an error. We find that over 66% errors are caused by the model's insensitivity to fine-grained information, and about 26.8% of the errors are caused by the mismatch between information accessible to the model and that accessible to human annotators. Our contributions include: (1) a comprehensive benchmark for automatic attribution evaluation, (2) comprehensive experiments showing that existing cutting-edge LLMs still cannot
Reach us at info@study.space
[slides and audio] AttributionBench%3A How Hard is Automatic Attribution Evaluation%3F