[slides and audio] Navigating the Metrics Maze%3A Reconciling Score Magnitudes and Accuracies

This paper investigates the dynamic range of modern machine translation metrics and their relationship to human perception of system differences. Ten years ago, BLEU was the dominant metric for machine translation evaluation, but today, multiple metrics compete in an ecosystem with no single dominant metric. The authors use a new large dataset, ToShip23, to analyze how different metrics perform in terms of score ranges and how these relate to human accuracy. They find that the meaning of score differences varies across metrics and that some metrics, like BLEU, are less reliable for evaluating systems with large score differences. The study compares the performance of various metrics, including BLEU, ChrF, spBLEU, BLEURT, and COMET, and shows how their score ranges and accuracy thresholds differ. The authors develop a method to estimate the accuracy of metric deltas by comparing system-level accuracy with human judgments. They find that this method is more stable than traditional statistical significance testing, especially as testset size increases. The paper also explores how metric deltas and accuracy vary across different features such as translation direction, domain, and system closeness. It shows that some metrics, like BLEU, are not suitable for evaluating unrelated systems, as they do not reliably reflect human judgment in such cases. The authors also find that statistical significance testing is not sufficient to determine model improvement, especially when testset size is small, but is important for small deltas. The study validates its findings on the WMT dataset and shows that the thresholds derived from ToShip23 are applicable across different languages and domains. It also highlights the importance of using metric-delta accuracy over p-values, as the former is stable across testset sizes. The paper concludes that CometKiwi $ _{22}^{QE} $ is a better metric for machine translation evaluation than BLEU, as it is not affected by reference bias and provides more accurate results. The authors recommend using CometKiwi $ _{22}^{QE} $ as the main metric and using at least one additional metric of a different type for evaluation. They also advise against using BLEU, ChrF, or spBLEU to evaluate unrelated systems.This paper investigates the dynamic range of modern machine translation metrics and their relationship to human perception of system differences. Ten years ago, BLEU was the dominant metric for machine translation evaluation, but today, multiple metrics compete in an ecosystem with no single dominant metric. The authors use a new large dataset, ToShip23, to analyze how different metrics perform in terms of score ranges and how these relate to human accuracy. They find that the meaning of score differences varies across metrics and that some metrics, like BLEU, are less reliable for evaluating systems with large score differences. The study compares the performance of various metrics, including BLEU, ChrF, spBLEU, BLEURT, and COMET, and shows how their score ranges and accuracy thresholds differ. The authors develop a method to estimate the accuracy of metric deltas by comparing system-level accuracy with human judgments. They find that this method is more stable than traditional statistical significance testing, especially as testset size increases. The paper also explores how metric deltas and accuracy vary across different features such as translation direction, domain, and system closeness. It shows that some metrics, like BLEU, are not suitable for evaluating unrelated systems, as they do not reliably reflect human judgment in such cases. The authors also find that statistical significance testing is not sufficient to determine model improvement, especially when testset size is small, but is important for small deltas. The study validates its findings on the WMT dataset and shows that the thresholds derived from ToShip23 are applicable across different languages and domains. It also highlights the importance of using metric-delta accuracy over p-values, as the former is stable across testset sizes. The paper concludes that CometKiwi $ _{22}^{QE} $ is a better metric for machine translation evaluation than BLEU, as it is not affected by reference bias and provides more accurate results. The authors recommend using CometKiwi $ _{22}^{QE} $ as the main metric and using at least one additional metric of a different type for evaluation. They also advise against using BLEU, ChrF, or spBLEU to evaluate unrelated systems.

Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies

10 Jun 2024 | Tom Kocmi, Vilém Zouhar, Christian Federmann, Matt Post