| George Doddington, doddington@nist.gov, 925/377-5883
The chapter discusses the automatic evaluation of machine translation (MT) quality using N-gram co-occurrence statistics, a technique introduced by IBM in 2001. This method compares MT output with expert reference translations based on the frequency of matching word N-grams. IBM's "BLEU" score, which includes a penalty for length differences, has shown strong correlation with human judgments of translation quality. NIST was commissioned to develop an MT evaluation facility based on this work, and the resulting utility is now available for TIDES MT research.
The chapter details the N-gram co-occurrence scoring process, including the conditioning of translated text to improve scoring accuracy. It also evaluates the stability and reliability of N-gram scores through various experiments, such as varying the number of reference translations, segment size, and language training. The NIST score formulation, which includes an information-weighted N-gram count and a modified brevity penalty, is introduced and compared with IBM's BLEU score. The NIST score is found to provide better stability and reliability, especially for human judgments of Adequacy.
The chapter also explores the impact of different parameters on the performance of the NIST scoring algorithm, including source language, number of references, segment size, and case information preservation. Finally, it describes the NIST MT Evaluation Facility, which includes an N-gram co-occurrence scoring utility and an email-based automatic evaluation utility for technology evaluations.The chapter discusses the automatic evaluation of machine translation (MT) quality using N-gram co-occurrence statistics, a technique introduced by IBM in 2001. This method compares MT output with expert reference translations based on the frequency of matching word N-grams. IBM's "BLEU" score, which includes a penalty for length differences, has shown strong correlation with human judgments of translation quality. NIST was commissioned to develop an MT evaluation facility based on this work, and the resulting utility is now available for TIDES MT research.
The chapter details the N-gram co-occurrence scoring process, including the conditioning of translated text to improve scoring accuracy. It also evaluates the stability and reliability of N-gram scores through various experiments, such as varying the number of reference translations, segment size, and language training. The NIST score formulation, which includes an information-weighted N-gram count and a modified brevity penalty, is introduced and compared with IBM's BLEU score. The NIST score is found to provide better stability and reliability, especially for human judgments of Adequacy.
The chapter also explores the impact of different parameters on the performance of the NIST scoring algorithm, including source language, number of references, segment size, and case information preservation. Finally, it describes the NIST MT Evaluation Facility, which includes an N-gram co-occurrence scoring utility and an email-based automatic evaluation utility for technology evaluations.