Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics

Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics

May-June 2003 | Chin-Yew Lin and Eduard Hovy
This paper presents an in-depth study of automatic evaluation methods for text summaries using n-gram co-occurrence statistics. The authors compare the effectiveness of n-gram co-occurrence-based scoring with human evaluations in the context of the Document Understanding Conference (DUC) 2002. They find that n-gram co-occurrence statistics, particularly unigram co-occurrence, correlate well with human evaluations, while the BLEU scoring method, though used in machine translation, does not always yield good results for summaries. The DUC 2002 task involved two main summarization tasks: single-document and multi-document summarization. For each task, human summaries were created at different lengths, and automatic summaries were evaluated using various metrics. The authors evaluated the performance of different n-gram co-occurrence metrics, including unigram, bigram, trigram, and 4-gram, and found that unigram co-occurrence statistics outperformed the weighted average of variable-length n-grams in terms of correlation with human assessments. The authors also examined the statistical significance of automatic evaluation metrics compared to human assessments. They found that unigram co-occurrence statistics provided good recall and precision in significance testing, indicating that they could effectively predict human assessments. However, the weighted average of variable-length n-grams did not always yield good results. The study concludes that unigram co-occurrence statistics are a promising automatic scoring metric for summary evaluation. The authors suggest that future research should explore other metrics, such as those based on information content (e.g., tf, tfidf, SVD), and consider using automatic question-answering tests for evaluation. They also propose an annual automatic evaluation track in DUC to encourage the development of new automated evaluation metrics.This paper presents an in-depth study of automatic evaluation methods for text summaries using n-gram co-occurrence statistics. The authors compare the effectiveness of n-gram co-occurrence-based scoring with human evaluations in the context of the Document Understanding Conference (DUC) 2002. They find that n-gram co-occurrence statistics, particularly unigram co-occurrence, correlate well with human evaluations, while the BLEU scoring method, though used in machine translation, does not always yield good results for summaries. The DUC 2002 task involved two main summarization tasks: single-document and multi-document summarization. For each task, human summaries were created at different lengths, and automatic summaries were evaluated using various metrics. The authors evaluated the performance of different n-gram co-occurrence metrics, including unigram, bigram, trigram, and 4-gram, and found that unigram co-occurrence statistics outperformed the weighted average of variable-length n-grams in terms of correlation with human assessments. The authors also examined the statistical significance of automatic evaluation metrics compared to human assessments. They found that unigram co-occurrence statistics provided good recall and precision in significance testing, indicating that they could effectively predict human assessments. However, the weighted average of variable-length n-grams did not always yield good results. The study concludes that unigram co-occurrence statistics are a promising automatic scoring metric for summary evaluation. The authors suggest that future research should explore other metrics, such as those based on information content (e.g., tf, tfidf, SVD), and consider using automatic question-answering tests for evaluation. They also propose an annual automatic evaluation track in DUC to encourage the development of new automated evaluation metrics.
Reach us at info@study.space