August 2006 | Matthew Snover and Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul
This paper introduces a new measure for evaluating machine translation (MT) quality called Human-Targeted Translation Edit Rate (HTER), which is shown to correlate better with human judgments than existing metrics like BLEU and METEOR. The paper compares HTER with other metrics, including the standard Translation Edit Rate (TER) and human-targeted variants (HBLEU and HMETEOR). The study shows that HTER, which uses human-generated references, correlates more strongly with human judgments than BLEU even when BLEU is given human-targeted references. HTER is defined as the minimum number of edits needed to transform a hypothesis into a fluent, semantically equivalent sentence, using a human-targeted reference. The paper also presents an algorithm for calculating the number of edits required to align the hypothesis with the reference, and discusses the advantages of using human-targeted references over traditional, untargeted ones. The results show that HTER reduces the edit rate by 33% compared to TER with four untargeted references, and that it correlates more strongly with human judgments than other metrics. The study also finds that HTER is less sensitive to the number of references than BLEU and that the standard deviation of scores decreases with targeted references. The paper concludes that HTER is a promising alternative to subjective human judgments for evaluating MT quality.This paper introduces a new measure for evaluating machine translation (MT) quality called Human-Targeted Translation Edit Rate (HTER), which is shown to correlate better with human judgments than existing metrics like BLEU and METEOR. The paper compares HTER with other metrics, including the standard Translation Edit Rate (TER) and human-targeted variants (HBLEU and HMETEOR). The study shows that HTER, which uses human-generated references, correlates more strongly with human judgments than BLEU even when BLEU is given human-targeted references. HTER is defined as the minimum number of edits needed to transform a hypothesis into a fluent, semantically equivalent sentence, using a human-targeted reference. The paper also presents an algorithm for calculating the number of edits required to align the hypothesis with the reference, and discusses the advantages of using human-targeted references over traditional, untargeted ones. The results show that HTER reduces the edit rate by 33% compared to TER with four untargeted references, and that it correlates more strongly with human judgments than other metrics. The study also finds that HTER is less sensitive to the number of references than BLEU and that the standard deviation of scores decreases with targeted references. The paper concludes that HTER is a promising alternative to subjective human judgments for evaluating MT quality.