[slides and audio] A Call for Clarity in Reporting BLEU Scores

The paper highlights the inconsistency in reporting BLEU scores in machine translation research. Although BLEU is widely used as the dominant metric, it is actually a parameterized metric whose values can vary significantly based on the parameters used. These parameters, such as tokenization and normalization schemes, are often not reported or are hard to find, making it difficult to compare BLEU scores across papers. The main issue is the use of different reference tokenization schemes, which can lead to significant differences in BLEU scores. The paper suggests that the machine translation community should adopt the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing. To facilitate this, the paper introduces a new tool, SACREBLEU, which automatically downloads and stores references for common test sets, providing a "protective layer" between them and the user. SACREBLEU also provides a version string that records the parameters used, which can be included in published papers. The paper argues that the lack of standardization in BLEU reporting hinders comparison and replication in machine translation research. The paper calls for a standardized approach to BLEU reporting to ensure that scores can be directly compared across papers.The paper highlights the inconsistency in reporting BLEU scores in machine translation research. Although BLEU is widely used as the dominant metric, it is actually a parameterized metric whose values can vary significantly based on the parameters used. These parameters, such as tokenization and normalization schemes, are often not reported or are hard to find, making it difficult to compare BLEU scores across papers. The main issue is the use of different reference tokenization schemes, which can lead to significant differences in BLEU scores. The paper suggests that the machine translation community should adopt the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing. To facilitate this, the paper introduces a new tool, SACREBLEU, which automatically downloads and stores references for common test sets, providing a "protective layer" between them and the user. SACREBLEU also provides a version string that records the parameters used, which can be included in published papers. The paper argues that the lack of standardization in BLEU reporting hinders comparison and replication in machine translation research. The paper calls for a standardized approach to BLEU reporting to ensure that scores can be directly compared across papers.

A Call for Clarity in Reporting BLEU Scores

12 Sep 2018 | Matt Post