[slides and audio] AttributionBench%3A How Hard is Automatic Attribution Evaluation%3F

**AttributionBench: How Hard is Automatic Attribution Evaluation?** **Authors:** Yifei Li, Xiang Yue, Zeyi Liao, Huan Sun **Institution:** The Ohio State University **Abstract:** Modern generative search engines enhance the reliability of large language model (LLM) responses by providing cited evidence. However, evaluating the attribution of these responses—whether every claim is fully supported by its cited evidence—is still an open problem. Traditional methods rely on costly human evaluation, highlighting the need for automatic attribution evaluation methods. To address this gap, we present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets. Our extensive experiments on AttributionBench reveal the challenges of automatic attribution evaluation, even for state-of-the-art LLMs. Specifically, our findings show that even a fine-tuned GPT-3.5 achieves only around 80% macro-F1 under a binary classification formulation. A detailed analysis of over 300 error cases indicates that the majority of failures stem from the model's inability to process nuanced information and the discrepancy between the information accessible to the model and human annotators. **Introduction:** The advent of large language models (LLMs) has revolutionized information retrieval and text generation, leading to advanced generative search engines. However, evaluating the attribution of these responses remains a challenge. While human evaluation is expensive, automatic evaluation methods are needed. Previous work has proposed frameworks and models for automatic attribution evaluation, but they often use different formulations and datasets, making direct comparisons difficult. **AttributionBench:** We propose AttributionBench, a systematic benchmark for training and evaluating automatic attribution evaluators. It unifies various datasets into a binary classification format with a label-balanced setting for fair comparison. The benchmark includes training and two test sets for in-distribution (ID) and out-of-distribution (OOD) evaluation. **Experimental Setup:** We conduct extensive experiments using multiple models, including decoder-only, encoder-decoder, and encoder-only models. Our results show that fine-tuning on NLI-related data is beneficial for attribution evaluation. However, automatic attribution evaluation remains challenging under zero-shot settings. Fine-tuning on AttributionBench improves performance on both ID and OOD evaluation. **Error Analysis:** A detailed qualitative error analysis using GPT-3.5 reveals that over 66% of errors are caused by the model's insensitivity to fine-grained information, and about 26.8% are due to the discrepancy between the information accessible to the model and human annotators. **Contributions:** - **Benchmark:** AttributionBench provides a comprehensive benchmark for fair comparison and tracking progress. - **Methods:** Extensive experiments show that existing LLMs still struggle with attribution evaluation. - **Analysis:** Insights into the challenges of automatic attribution evaluation and potential future work. **Conclusion:** Our research highlights the difficulties of automatic attribution evaluation and provides valuable insights**AttributionBench: How Hard is Automatic Attribution Evaluation?** **Authors:** Yifei Li, Xiang Yue, Zeyi Liao, Huan Sun **Institution:** The Ohio State University **Abstract:** Modern generative search engines enhance the reliability of large language model (LLM) responses by providing cited evidence. However, evaluating the attribution of these responses—whether every claim is fully supported by its cited evidence—is still an open problem. Traditional methods rely on costly human evaluation, highlighting the need for automatic attribution evaluation methods. To address this gap, we present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets. Our extensive experiments on AttributionBench reveal the challenges of automatic attribution evaluation, even for state-of-the-art LLMs. Specifically, our findings show that even a fine-tuned GPT-3.5 achieves only around 80% macro-F1 under a binary classification formulation. A detailed analysis of over 300 error cases indicates that the majority of failures stem from the model's inability to process nuanced information and the discrepancy between the information accessible to the model and human annotators. **Introduction:** The advent of large language models (LLMs) has revolutionized information retrieval and text generation, leading to advanced generative search engines. However, evaluating the attribution of these responses remains a challenge. While human evaluation is expensive, automatic evaluation methods are needed. Previous work has proposed frameworks and models for automatic attribution evaluation, but they often use different formulations and datasets, making direct comparisons difficult. **AttributionBench:** We propose AttributionBench, a systematic benchmark for training and evaluating automatic attribution evaluators. It unifies various datasets into a binary classification format with a label-balanced setting for fair comparison. The benchmark includes training and two test sets for in-distribution (ID) and out-of-distribution (OOD) evaluation. **Experimental Setup:** We conduct extensive experiments using multiple models, including decoder-only, encoder-decoder, and encoder-only models. Our results show that fine-tuning on NLI-related data is beneficial for attribution evaluation. However, automatic attribution evaluation remains challenging under zero-shot settings. Fine-tuning on AttributionBench improves performance on both ID and OOD evaluation. **Error Analysis:** A detailed qualitative error analysis using GPT-3.5 reveals that over 66% of errors are caused by the model's insensitivity to fine-grained information, and about 26.8% are due to the discrepancy between the information accessible to the model and human annotators. **Contributions:** - **Benchmark:** AttributionBench provides a comprehensive benchmark for fair comparison and tracking progress. - **Methods:** Extensive experiments show that existing LLMs still struggle with attribution evaluation. - **Analysis:** Insights into the challenges of automatic attribution evaluation and potential future work. **Conclusion:** Our research highlights the difficulties of automatic attribution evaluation and provides valuable insights

AttributionBench: How Hard is Automatic Attribution Evaluation?

23 Feb 2024 | Yifei Li, Xiang Yue, Zeyi Liao, Huan Sun