Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics

Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics

2001 | George Doddington
Automatic evaluation of machine translation (MT) quality using n-gram co-occurrence statistics is a method that compares MT output with expert reference translations based on the frequency of short word sequences (n-grams). IBM developed this technique, which is now used by NIST as the primary evaluation measure for MT research. The IBM algorithm, known as BLEU, calculates scores based on the weighted sum of matching n-grams and includes a penalty for significant differences in translation length. The formula for BLEU is given, and the method involves segment-by-segment scoring, with conditioning steps to improve scoring efficacy. N-gram co-occurrence scoring has shown strong correlation with human assessments of translation quality, though correlations are generally higher for machine translations than for professional ones. The technique has been validated across several corpora, including the 1994 and 2001 evaluations, where BLEU scores correlated highly with human judgments, except for Japanese fluency scores. The stability of BLEU scores was also assessed, with higher F-ratios indicating better performance. NIST has modified the IBM formulation to improve score stability and reliability, incorporating information weights for n-grams and adjusting the brevity penalty. The NIST score formula includes a modified brevity penalty that minimizes the impact of small length variations. The NIST score has been compared with IBM's BLEU score, showing improved stability and reliability across different corpora. Performance of the NIST scoring algorithm was analyzed in terms of source data, number of references, segment size, and language training. Results showed that performance varied depending on these factors, with some improvements in correlation and F-ratio when using more references or larger segment sizes. However, increasing the number of references had only modest effects on performance. The use of more language training data did not significantly improve performance, and higher-order n-grams often did not contribute to the score. Preserving case information did not significantly affect scoring performance, and reference normalization did not change the F-ratio or correlation with human assessments. NIST now provides an evaluation facility that supports MT research, including an N-gram co-occurrence scoring utility and an email-based automatic evaluation tool. This facility allows researchers to evaluate translations of various languages into English using a standardized method.Automatic evaluation of machine translation (MT) quality using n-gram co-occurrence statistics is a method that compares MT output with expert reference translations based on the frequency of short word sequences (n-grams). IBM developed this technique, which is now used by NIST as the primary evaluation measure for MT research. The IBM algorithm, known as BLEU, calculates scores based on the weighted sum of matching n-grams and includes a penalty for significant differences in translation length. The formula for BLEU is given, and the method involves segment-by-segment scoring, with conditioning steps to improve scoring efficacy. N-gram co-occurrence scoring has shown strong correlation with human assessments of translation quality, though correlations are generally higher for machine translations than for professional ones. The technique has been validated across several corpora, including the 1994 and 2001 evaluations, where BLEU scores correlated highly with human judgments, except for Japanese fluency scores. The stability of BLEU scores was also assessed, with higher F-ratios indicating better performance. NIST has modified the IBM formulation to improve score stability and reliability, incorporating information weights for n-grams and adjusting the brevity penalty. The NIST score formula includes a modified brevity penalty that minimizes the impact of small length variations. The NIST score has been compared with IBM's BLEU score, showing improved stability and reliability across different corpora. Performance of the NIST scoring algorithm was analyzed in terms of source data, number of references, segment size, and language training. Results showed that performance varied depending on these factors, with some improvements in correlation and F-ratio when using more references or larger segment sizes. However, increasing the number of references had only modest effects on performance. The use of more language training data did not significantly improve performance, and higher-order n-grams often did not contribute to the score. Preserving case information did not significantly affect scoring performance, and reference normalization did not change the F-ratio or correlation with human assessments. NIST now provides an evaluation facility that supports MT research, including an N-gram co-occurrence scoring utility and an email-based automatic evaluation tool. This facility allows researchers to evaluate translations of various languages into English using a standardized method.
Reach us at info@study.space