28 Feb 2024 | Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner
Tokenization is a critical step in Natural Language Processing (NLP) that converts human-readable text into a sequence of tokens for use by statistical models. While existing tokenization methods like Byte-Pair Encoding (BPE) are widely used, the assumption that fewer tokens always lead to better performance has been challenged. This paper introduces PATHPIECE, a new tokenizer that minimizes the number of tokens used to represent a document. Through extensive experiments, the authors find that the hypothesis that fewer tokens improve downstream performance is not valid. Instead, they explore the impact of various design decisions in the three stages of tokenization: pre-tokenization, vocabulary construction, and segmentation. They find that pre-tokenization and the use of BPE for vocabulary construction are important. The study also evaluates the performance of 64 language models with varying tokenization methods, ranging from 350M to 2.4B parameters. The results show that there is no single tokenizer that outperforms others in all cases, and that vocabulary size has little impact on downstream performance. The study also highlights the importance of pre-tokenization methods and the effectiveness of different segmentation strategies. Overall, the findings suggest that the effectiveness of tokenization is influenced by multiple factors, and that the compression hypothesis is not the sole determinant of performance. The authors provide open-source access to their tokenization methods and models, encouraging further research in this area.Tokenization is a critical step in Natural Language Processing (NLP) that converts human-readable text into a sequence of tokens for use by statistical models. While existing tokenization methods like Byte-Pair Encoding (BPE) are widely used, the assumption that fewer tokens always lead to better performance has been challenged. This paper introduces PATHPIECE, a new tokenizer that minimizes the number of tokens used to represent a document. Through extensive experiments, the authors find that the hypothesis that fewer tokens improve downstream performance is not valid. Instead, they explore the impact of various design decisions in the three stages of tokenization: pre-tokenization, vocabulary construction, and segmentation. They find that pre-tokenization and the use of BPE for vocabulary construction are important. The study also evaluates the performance of 64 language models with varying tokenization methods, ranging from 350M to 2.4B parameters. The results show that there is no single tokenizer that outperforms others in all cases, and that vocabulary size has little impact on downstream performance. The study also highlights the importance of pre-tokenization methods and the effectiveness of different segmentation strategies. Overall, the findings suggest that the effectiveness of tokenization is influenced by multiple factors, and that the compression hypothesis is not the sole determinant of performance. The authors provide open-source access to their tokenization methods and models, encouraging further research in this area.