CHRF: character n-gram F-score for automatic MT evaluation

CHRF: character n-gram F-score for automatic MT evaluation

17-18 September 2015 | Maja Popović
This paper proposes the use of character n-gram F-score (CHRF) for automatic evaluation of machine translation (MT) output. Character n-grams have been used in complex metrics, but their individual potential has not been explored. The authors report system-level correlations with human rankings for CHRF (6-gram F1-score) on WMT12, WMT13, and WMT14 data, as well as segment-level correlations for CHRF (6-gram F1) and CHRF3 (6-gram F3) on WMT14 data for all available target languages. The results are promising, especially for CHRF3, which showed the highest segment-level correlations for English translation, outperforming even the best metrics on the WMT14 shared evaluation task. The CHRF score is calculated using the formula: CHRFβ = (1 + β²) * (CHRP * CHRR) / (β² * CHRP + CHRR), where CHRP and CHRR are the precision and recall of character n-grams. β is a parameter that assigns more weight to recall than precision. The authors tested different n-gram lengths and found that 6-grams yielded the best correlations. They also tested different β values, finding that CHRF3 (β=3) performed best on WMT14 data. The authors compared CHRF with other metrics (BLEU, TER, METEOR, WORDF) and found that CHRF3 outperformed all standard metrics in most cases. Segment-level correlations were measured using Kendall's τ rank correlation coefficient, and CHRF3 showed the highest average correlation for English translation. The authors conclude that CHRF is a promising metric for MT evaluation due to its language and tokenization independence, and its good correlation with human judgments at both system and segment levels. They also note that further research is needed to explore different β values and n-gram weights, as well as to apply CHRF to more languages with different writing systems.This paper proposes the use of character n-gram F-score (CHRF) for automatic evaluation of machine translation (MT) output. Character n-grams have been used in complex metrics, but their individual potential has not been explored. The authors report system-level correlations with human rankings for CHRF (6-gram F1-score) on WMT12, WMT13, and WMT14 data, as well as segment-level correlations for CHRF (6-gram F1) and CHRF3 (6-gram F3) on WMT14 data for all available target languages. The results are promising, especially for CHRF3, which showed the highest segment-level correlations for English translation, outperforming even the best metrics on the WMT14 shared evaluation task. The CHRF score is calculated using the formula: CHRFβ = (1 + β²) * (CHRP * CHRR) / (β² * CHRP + CHRR), where CHRP and CHRR are the precision and recall of character n-grams. β is a parameter that assigns more weight to recall than precision. The authors tested different n-gram lengths and found that 6-grams yielded the best correlations. They also tested different β values, finding that CHRF3 (β=3) performed best on WMT14 data. The authors compared CHRF with other metrics (BLEU, TER, METEOR, WORDF) and found that CHRF3 outperformed all standard metrics in most cases. Segment-level correlations were measured using Kendall's τ rank correlation coefficient, and CHRF3 showed the highest average correlation for English translation. The authors conclude that CHRF is a promising metric for MT evaluation due to its language and tokenization independence, and its good correlation with human judgments at both system and segment levels. They also note that further research is needed to explore different β values and n-gram weights, as well as to apply CHRF to more languages with different writing systems.
Reach us at info@study.space