April 15, 2024 | Nived Rajaraman, Jiantao Jiao, Kannan Ramchandran
This paper investigates the theoretical aspects of tokenization in language models (LMs) by studying the behavior of transformers on simple data generating processes. The authors find that when transformers are trained on data from certain $k^{\text{th}}$-order Markov processes without tokenization, they empirically fail to learn the correct distribution and predict characters according to a unigram model, leading to high cross-entropy loss. However, with tokenization, transformers are able to break through this barrier and model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. The study also shows that even the simplest unigram models learned by transformers can model the probability of sequences from $k^{\text{th}}$-order Markov sources near optimally with appropriate tokenization. The analysis provides a justification for the use of tokenization in practice by demonstrating its effectiveness in capturing Markovian data. The paper further analyzes the behavior of tokenizers, including the LZW and BPE tokenizers, and proves that they can achieve near-optimal cross-entropy loss with much smaller dictionaries compared to unigram models. The results highlight the importance of tokenization in improving the performance of transformers on Markovian data.This paper investigates the theoretical aspects of tokenization in language models (LMs) by studying the behavior of transformers on simple data generating processes. The authors find that when transformers are trained on data from certain $k^{\text{th}}$-order Markov processes without tokenization, they empirically fail to learn the correct distribution and predict characters according to a unigram model, leading to high cross-entropy loss. However, with tokenization, transformers are able to break through this barrier and model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. The study also shows that even the simplest unigram models learned by transformers can model the probability of sequences from $k^{\text{th}}$-order Markov sources near optimally with appropriate tokenization. The analysis provides a justification for the use of tokenization in practice by demonstrating its effectiveness in capturing Markovian data. The paper further analyzes the behavior of tokenizers, including the LZW and BPE tokenizers, and proves that they can achieve near-optimal cross-entropy loss with much smaller dictionaries compared to unigram models. The results highlight the importance of tokenization in improving the performance of transformers on Markovian data.