The paper "A Call for Clarity in Reporting BLEU Scores" by Matt Post addresses the issue of inconsistent reporting of BLEU scores in machine translation research. BLEU, a dominant metric for evaluating translation quality, is parameterized and can vary significantly with different configurations. The main problem is that the reference preprocessing schemes used by researchers are often not reported or are hard to find, making it difficult to compare scores across papers. Post quantifies this variation, finding differences as high as 1.8 between commonly used configurations. The primary cause of this incompatibility is user-supplied reference tokenization. To address this, Post suggests that researchers use the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing. He also introduces SACREBLEU, a Python tool that automatically downloads and stores references for common test sets, facilitates consistent reference tokenization, and provides a version string to document the parameters used. This tool aims to improve the comparability and reproducibility of BLEU scores in machine translation research.The paper "A Call for Clarity in Reporting BLEU Scores" by Matt Post addresses the issue of inconsistent reporting of BLEU scores in machine translation research. BLEU, a dominant metric for evaluating translation quality, is parameterized and can vary significantly with different configurations. The main problem is that the reference preprocessing schemes used by researchers are often not reported or are hard to find, making it difficult to compare scores across papers. Post quantifies this variation, finding differences as high as 1.8 between commonly used configurations. The primary cause of this incompatibility is user-supplied reference tokenization. To address this, Post suggests that researchers use the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing. He also introduces SACREBLEU, a Python tool that automatically downloads and stores references for common test sets, facilitates consistent reference tokenization, and provides a version string to document the parameters used. This tool aims to improve the comparability and reproducibility of BLEU scores in machine translation research.