[slides] chrF%3A character n-gram F-score for automatic MT evaluation

The paper introduces the Character $n$-gram F-score (CHRF) as a new metric for automatic evaluation of machine translation (MT) output. The authors investigate the individual potential of character $n$-grams, which have been used in more complex metrics like MTERRATER and BEER but have not been studied as standalone metrics. The CHRF score is calculated using character $n$-grams and is designed to be simple, language-independent, and tokenization-independent. The study reports system-level correlations with human rankings for the 6-gram F1-score (CHRF) on WMT12, WMT13, and WMT14 data, and segment-level correlations for the 6-gram F1 (CHRF) and F3-scores (CHRF3) on WMT14 data for all available target languages. The results show that the CHRF3 score, which gives more weight to recall, outperforms other metrics, including BLEU, TER, and METEOR, in terms of segment-level correlations, especially for translations from English. The paper concludes that CHRF is a promising metric for automatic MT evaluation, and future work will explore different $\beta$ values and weights for $n$-grams, as well as its application in more languages with different writing systems.The paper introduces the Character $n$-gram F-score (CHRF) as a new metric for automatic evaluation of machine translation (MT) output. The authors investigate the individual potential of character $n$-grams, which have been used in more complex metrics like MTERRATER and BEER but have not been studied as standalone metrics. The CHRF score is calculated using character $n$-grams and is designed to be simple, language-independent, and tokenization-independent. The study reports system-level correlations with human rankings for the 6-gram F1-score (CHRF) on WMT12, WMT13, and WMT14 data, and segment-level correlations for the 6-gram F1 (CHRF) and F3-scores (CHRF3) on WMT14 data for all available target languages. The results show that the CHRF3 score, which gives more weight to recall, outperforms other metrics, including BLEU, TER, and METEOR, in terms of segment-level correlations, especially for translations from English. The paper concludes that CHRF is a promising metric for automatic MT evaluation, and future work will explore different $\beta$ values and weights for $n$-grams, as well as its application in more languages with different writing systems.

CHRF: character n-gram F-score for automatic MT evaluation

17-18 September 2015 | Maja Popović