Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies

Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies

10 Jun 2024 | Tom Kocmi, Vilém Zouhar, Christian Federmann, Matt Post
This paper addresses the challenge of reconciling the "dynamic range" of modern metrics in machine translation (MT) research, where no single metric dominates. The authors investigate the "metric delta," the score difference that humans can notice, using a large dataset, ToShip23, to measure pairwise system accuracy. They find that different metrics have varying dynamic ranges and that the reliability of metrics differs from human judgment. The study introduces a method to establish delta-accuracy, which is more stable than statistical p-values, and explores the impact of factors such as testset size, dataset and domain selection, and translation direction on metric deltas. The results show that metrics like BLEU have lower accuracy thresholds compared to newer metrics like CometKiwi22QE, and that BLEU is unreliable for evaluating unrelated systems. The paper concludes with recommendations for MT evaluation, emphasizing the use of CometKiwi22QE as the primary metric and reporting estimated accuracy alongside significance testing and metric delta.This paper addresses the challenge of reconciling the "dynamic range" of modern metrics in machine translation (MT) research, where no single metric dominates. The authors investigate the "metric delta," the score difference that humans can notice, using a large dataset, ToShip23, to measure pairwise system accuracy. They find that different metrics have varying dynamic ranges and that the reliability of metrics differs from human judgment. The study introduces a method to establish delta-accuracy, which is more stable than statistical p-values, and explores the impact of factors such as testset size, dataset and domain selection, and translation direction on metric deltas. The results show that metrics like BLEU have lower accuracy thresholds compared to newer metrics like CometKiwi22QE, and that BLEU is unreliable for evaluating unrelated systems. The paper concludes with recommendations for MT evaluation, emphasizing the use of CometKiwi22QE as the primary metric and reporting estimated accuracy alongside significance testing and metric delta.
Reach us at info@study.space
[slides] Navigating the Metrics Maze%3A Reconciling Score Magnitudes and Accuracies | StudySpace