Towards Fine-Grained Citation Evaluation in Generated Text: A Comparative Analysis of Faithfulness Metrics

Towards Fine-Grained Citation Evaluation in Generated Text: A Comparative Analysis of Faithfulness Metrics

23 Aug 2024 | Weijia Zhang, Mohammad Aliannejadi, Yifei Yuan, Jiahuan Pei, Jia-Hong Huang, Evangelos Kanoulas
This paper investigates the effectiveness of faithfulness metrics in evaluating citation support in generated text, focusing on fine-grained support levels (full, partial, no support). The authors propose a comparative evaluation framework that assesses how well faithfulness metrics align with human judgments across three support levels. The framework includes correlation analysis, classification evaluation, and retrieval evaluation to comprehensively measure the alignment between metric scores and human judgments. The results show that no single metric consistently excels across all evaluations, highlighting the complexity of fine-grained citation evaluation. The study also finds that similarity-based metrics perform better in retrieval evaluation than entailment-based metrics, suggesting that entailment-based metrics are more sensitive to noisy data. The authors provide practical recommendations for developing more effective metrics, including the development of training resources with fine-grained support level annotations, the introduction of contrastive learning for robustness, and the development of more explainable metrics. The study emphasizes the need for further research into citation evaluation, particularly in complex scenarios where citations may be distributed across multiple sources. The findings suggest that current faithfulness metrics have limitations in distinguishing fine-grained support levels, and that more sophisticated approaches are needed to improve the accuracy and reliability of automated citation evaluation.This paper investigates the effectiveness of faithfulness metrics in evaluating citation support in generated text, focusing on fine-grained support levels (full, partial, no support). The authors propose a comparative evaluation framework that assesses how well faithfulness metrics align with human judgments across three support levels. The framework includes correlation analysis, classification evaluation, and retrieval evaluation to comprehensively measure the alignment between metric scores and human judgments. The results show that no single metric consistently excels across all evaluations, highlighting the complexity of fine-grained citation evaluation. The study also finds that similarity-based metrics perform better in retrieval evaluation than entailment-based metrics, suggesting that entailment-based metrics are more sensitive to noisy data. The authors provide practical recommendations for developing more effective metrics, including the development of training resources with fine-grained support level annotations, the introduction of contrastive learning for robustness, and the development of more explainable metrics. The study emphasizes the need for further research into citation evaluation, particularly in complex scenarios where citations may be distributed across multiple sources. The findings suggest that current faithfulness metrics have limitations in distinguishing fine-grained support levels, and that more sophisticated approaches are needed to improve the accuracy and reliability of automated citation evaluation.
Reach us at info@study.space