This paper addresses the issue of accurately computing word probabilities using language models (LMs), which are typically trained on subword units rather than individual words. While many linguistic studies rely on LM outputs to estimate word probabilities, they often fail to account for the complexities introduced by subword tokenization, particularly when using beginning-of-word (bow)-marking tokenizers. The authors demonstrate that this oversight leads to incorrect probability estimates, which can significantly affect results in studies of sentence comprehension and lexical efficiency.
The paper derives the correct method for computing word probabilities by considering the relationship between subword and word probabilities. It shows that when using bow-marking tokenizers, the probability of a word in context must be adjusted by normalizing over possible subword sequences that could represent the word. This correction is essential for accurate probability estimation and has implications for empirical studies that rely on LM outputs.
The authors also discuss the theoretical and practical implications of these findings. They highlight that the choice of tokenizer can significantly impact the computation of word probabilities, and that many widely-used LMs employ bow-marking tokenizers, leading to widespread inaccuracies in probability estimation. Empirical evaluations show that correcting these computations leads to statistically significant differences in results, even though the overall conclusions of previous studies remain unchanged.
The paper concludes that precise computational methods are crucial for linguistic research, and that future work should adopt these corrections to enhance the reliability of empirical analyses. The authors also note limitations in their approach, including the need to consider additional corner cases and the applicability of their methods to non-autoregressive models.This paper addresses the issue of accurately computing word probabilities using language models (LMs), which are typically trained on subword units rather than individual words. While many linguistic studies rely on LM outputs to estimate word probabilities, they often fail to account for the complexities introduced by subword tokenization, particularly when using beginning-of-word (bow)-marking tokenizers. The authors demonstrate that this oversight leads to incorrect probability estimates, which can significantly affect results in studies of sentence comprehension and lexical efficiency.
The paper derives the correct method for computing word probabilities by considering the relationship between subword and word probabilities. It shows that when using bow-marking tokenizers, the probability of a word in context must be adjusted by normalizing over possible subword sequences that could represent the word. This correction is essential for accurate probability estimation and has implications for empirical studies that rely on LM outputs.
The authors also discuss the theoretical and practical implications of these findings. They highlight that the choice of tokenizer can significantly impact the computation of word probabilities, and that many widely-used LMs employ bow-marking tokenizers, leading to widespread inaccuracies in probability estimation. Empirical evaluations show that correcting these computations leads to statistically significant differences in results, even though the overall conclusions of previous studies remain unchanged.
The paper concludes that precise computational methods are crucial for linguistic research, and that future work should adopt these corrections to enhance the reliability of empirical analyses. The authors also note limitations in their approach, including the need to consider additional corner cases and the applicability of their methods to non-autoregressive models.