BLEU: a Method for Automatic Evaluation of Machine Translation

BLEU: a Method for Automatic Evaluation of Machine Translation

July 2002 | Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu
BLEU is an automatic evaluation method for machine translation (MT) that is quick, inexpensive, and language-independent. It correlates highly with human evaluation and has low marginal cost per run. The method is designed to substitute for human judges when quick or frequent evaluations are needed. BLEU uses a weighted average of variable-length phrase matches against reference translations, inspired by the word error rate metric used in speech recognition. The main idea is to measure the closeness of a machine translation to one or more reference translations using a numerical metric. The BLEU metric is based on modified n-gram precision, which accounts for the frequency of words in reference translations and penalizes overuse of words. It also includes a brevity penalty to account for differences in translation length. BLEU is calculated as the geometric mean of modified n-gram precisions multiplied by an exponential brevity penalty factor. The metric ranges from 0 to 1, with higher scores indicating better translations. BLEU was evaluated against human judgments and showed strong correlation with human evaluations. It was tested on a corpus of 500 sentences and compared with human translations. The results showed that BLEU accurately distinguished between high-quality and low-quality translations. The metric was also compared with human evaluations from monolingual and bilingual judges, and it was found to correlate well with human judgments. BLEU is a valuable tool for evaluating MT systems as it provides a quick and reliable measure of translation quality. It is particularly useful for large-scale evaluations and can help researchers identify effective modeling ideas. BLEU is also applicable to other natural language generation tasks such as summarization. The method has been shown to be effective across multiple languages and is a promising tool for advancing MT research and development.BLEU is an automatic evaluation method for machine translation (MT) that is quick, inexpensive, and language-independent. It correlates highly with human evaluation and has low marginal cost per run. The method is designed to substitute for human judges when quick or frequent evaluations are needed. BLEU uses a weighted average of variable-length phrase matches against reference translations, inspired by the word error rate metric used in speech recognition. The main idea is to measure the closeness of a machine translation to one or more reference translations using a numerical metric. The BLEU metric is based on modified n-gram precision, which accounts for the frequency of words in reference translations and penalizes overuse of words. It also includes a brevity penalty to account for differences in translation length. BLEU is calculated as the geometric mean of modified n-gram precisions multiplied by an exponential brevity penalty factor. The metric ranges from 0 to 1, with higher scores indicating better translations. BLEU was evaluated against human judgments and showed strong correlation with human evaluations. It was tested on a corpus of 500 sentences and compared with human translations. The results showed that BLEU accurately distinguished between high-quality and low-quality translations. The metric was also compared with human evaluations from monolingual and bilingual judges, and it was found to correlate well with human judgments. BLEU is a valuable tool for evaluating MT systems as it provides a quick and reliable measure of translation quality. It is particularly useful for large-scale evaluations and can help researchers identify effective modeling ideas. BLEU is also applicable to other natural language generation tasks such as summarization. The method has been shown to be effective across multiple languages and is a promising tool for advancing MT research and development.
Reach us at info@study.space
[slides and audio] Bleu%3A a Method for Automatic Evaluation of Machine Translation