[slides] How to Compute the Probability of a Word

This paper addresses the accurate computation of word probabilities using language models (LMs), which are crucial for estimating perplexity and surprisal in linguistics research. While LMs typically operate over subwords, the correct computation of word probabilities is more complex due to the use of beginning-of-word (BOW)-marking tokenizers, such as those used in GPT models. The authors derive methods to compute word probabilities correctly, highlighting issues with common BOW-marking tokenizers. Empirically, they show that correcting the computation of word probabilities affects sentence comprehension and lexical optimization analyses, demonstrating the importance of precise computational methods in linguistic research. The findings suggest that future studies should adopt these corrections to enhance the reliability of their analyses.This paper addresses the accurate computation of word probabilities using language models (LMs), which are crucial for estimating perplexity and surprisal in linguistics research. While LMs typically operate over subwords, the correct computation of word probabilities is more complex due to the use of beginning-of-word (BOW)-marking tokenizers, such as those used in GPT models. The authors derive methods to compute word probabilities correctly, highlighting issues with common BOW-marking tokenizers. Empirically, they show that correcting the computation of word probabilities affects sentence comprehension and lexical optimization analyses, demonstrating the importance of precise computational methods in linguistic research. The findings suggest that future studies should adopt these corrections to enhance the reliability of their analyses.

How to Compute the Probability of a Word

20 Jun 2024 | Tiago Pimentel, Clara Meister