This paper explores the use of n-gram co-occurrence statistics for automatically evaluating summaries, following the success of BLEU/NIST scoring in machine translation. The authors conduct an in-depth study using the Document Understanding Conference (DUC) 2002 data, which includes both single-document and multi-document summarization tasks. They find that unigram co-occurrence statistics correlate well with human evaluations, while the direct application of BLEU does not always yield good results. The study proposes two criteria for evaluating automatic evaluation metrics: correlation with human assessments and statistical significance prediction. Unigram and bi-gram co-occurrence statistics consistently outperform the weighted average of variable-length n-gram matches in terms of correlation and statistical significance. The authors suggest that unigram co-occurrence statistics may be particularly effective because most systems in DUC generate summaries through sentence extraction. They also propose future directions, including exploring different metrics and integrating automatic evaluation into system development and evaluation processes.This paper explores the use of n-gram co-occurrence statistics for automatically evaluating summaries, following the success of BLEU/NIST scoring in machine translation. The authors conduct an in-depth study using the Document Understanding Conference (DUC) 2002 data, which includes both single-document and multi-document summarization tasks. They find that unigram co-occurrence statistics correlate well with human evaluations, while the direct application of BLEU does not always yield good results. The study proposes two criteria for evaluating automatic evaluation metrics: correlation with human assessments and statistical significance prediction. Unigram and bi-gram co-occurrence statistics consistently outperform the weighted average of variable-length n-gram matches in terms of correlation and statistical significance. The authors suggest that unigram co-occurrence statistics may be particularly effective because most systems in DUC generate summaries through sentence extraction. They also propose future directions, including exploring different metrics and integrating automatic evaluation into system development and evaluation processes.