Toward a Theory of Tokenization in LLMs

Toward a Theory of Tokenization in LLMs

April 15, 2024 | Nived Rajaraman, Jiantao Jiao, Kannan Ramchandran
This paper investigates the role of tokenization in language models (LLMs) by studying the behavior of transformers on simple data generating processes. While there is a large body of research attempting to bypass tokenization for language modeling, the consensus is that tokenization is a necessary initial step for designing state-of-the-art LLMs. The authors analyze the behavior of transformers on k-th order Markov processes and show that without tokenization, transformers fail to learn the correct distribution and predict characters according to a unigram model. However, with tokenization, transformers are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. The paper also shows that even the simplest unigram models learned by transformers are able to model the probability of sequences drawn from k-th order Markov sources near-optimally with appropriate tokenization. The authors analyze a toy tokenizer that adds all length-k sequences into the dictionary and show that as the dictionary size grows, unigram models trained on the tokens get better at modeling the probabilities of sequences drawn from Markov sources. They also theoretically prove that tokenizers used in practice, such as the LZW tokenizer and a variant of the BPE tokenizer, satisfy this property but require much smaller dictionaries to achieve any target cross-entropy loss. The paper also discusses the generalization ability of tokenizers and shows that there exist tokenizers which generalize poorly and others that generalize well. The authors conclude that tokenization is essential for LLMs to achieve near-optimal performance on Markovian data.This paper investigates the role of tokenization in language models (LLMs) by studying the behavior of transformers on simple data generating processes. While there is a large body of research attempting to bypass tokenization for language modeling, the consensus is that tokenization is a necessary initial step for designing state-of-the-art LLMs. The authors analyze the behavior of transformers on k-th order Markov processes and show that without tokenization, transformers fail to learn the correct distribution and predict characters according to a unigram model. However, with tokenization, transformers are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. The paper also shows that even the simplest unigram models learned by transformers are able to model the probability of sequences drawn from k-th order Markov sources near-optimally with appropriate tokenization. The authors analyze a toy tokenizer that adds all length-k sequences into the dictionary and show that as the dictionary size grows, unigram models trained on the tokens get better at modeling the probabilities of sequences drawn from Markov sources. They also theoretically prove that tokenizers used in practice, such as the LZW tokenizer and a variant of the BPE tokenizer, satisfy this property but require much smaller dictionaries to achieve any target cross-entropy loss. The paper also discusses the generalization ability of tokenizers and shows that there exist tokenizers which generalize poorly and others that generalize well. The authors conclude that tokenization is essential for LLMs to achieve near-optimal performance on Markovian data.
Reach us at info@study.space