The paper "Statistical Significance Tests for Machine Translation Evaluation" by Philipp Koehn discusses the importance of statistical methods in assessing the true quality of machine translation systems, particularly focusing on the BLEU score. The author introduces bootstrap resampling methods to compute the statistical significance of test results, validating these methods through experiments on the BLEU score. Even with small test sets of only 300 sentences, the methods can provide reliable assurances that observed score differences are real. The paper also explores the properties of the BLEU metric, such as its reliance on higher n-grams and the brevity penalty, and provides empirical evidence that the estimated significance levels are accurate. The author emphasizes the need for a trusted experimental framework to draw valid conclusions about system improvements and highlights the importance of assembling representative test sets from different parts of a larger corpus. The paper concludes by advocating for the use of statistical significance tests in published machine translation research.The paper "Statistical Significance Tests for Machine Translation Evaluation" by Philipp Koehn discusses the importance of statistical methods in assessing the true quality of machine translation systems, particularly focusing on the BLEU score. The author introduces bootstrap resampling methods to compute the statistical significance of test results, validating these methods through experiments on the BLEU score. Even with small test sets of only 300 sentences, the methods can provide reliable assurances that observed score differences are real. The paper also explores the properties of the BLEU metric, such as its reliance on higher n-grams and the brevity penalty, and provides empirical evidence that the estimated significance levels are accurate. The author emphasizes the need for a trusted experimental framework to draw valid conclusions about system improvements and highlights the importance of assembling representative test sets from different parts of a larger corpus. The paper concludes by advocating for the use of statistical significance tests in published machine translation research.