[slides and audio] Tokenization Is More Than Compression

Tokenization is a foundational step in Natural Language Processing (NLP), bridging raw text and language models. Existing tokenization approaches, such as Byte-Pair Encoding (BPE), are often justified by their ability to compress text into a small number of tokens. However, the authors test the hypothesis that fewer tokens lead to better downstream performance using a new tokenizer called PathPiece, which segments text into the minimum number of tokens for a given vocabulary. Through extensive experimentation, they find that this hypothesis is not supported, challenging the understanding of effective tokenization. They evaluate design decisions across all three phases of tokenization: pre-tokenization, vocabulary construction, and segmentation, providing new insights into the design of effective tokenizers. Specifically, they highlight the importance of pre-tokenization and the benefits of using BPE to initialize vocabulary construction. They train 64 language models with varying tokenization methods, ranging from 350M to 2.4B parameters, and make all models and tokenized vocabularies publicly available. Their findings suggest that there is no single best tokenizer algorithm and that vocabulary size has little impact on downstream performance, challenging the current understanding of why BPE is particularly effective.Tokenization is a foundational step in Natural Language Processing (NLP), bridging raw text and language models. Existing tokenization approaches, such as Byte-Pair Encoding (BPE), are often justified by their ability to compress text into a small number of tokens. However, the authors test the hypothesis that fewer tokens lead to better downstream performance using a new tokenizer called PathPiece, which segments text into the minimum number of tokens for a given vocabulary. Through extensive experimentation, they find that this hypothesis is not supported, challenging the understanding of effective tokenization. They evaluate design decisions across all three phases of tokenization: pre-tokenization, vocabulary construction, and segmentation, providing new insights into the design of effective tokenizers. Specifically, they highlight the importance of pre-tokenization and the benefits of using BPE to initialize vocabulary construction. They train 64 language models with varying tokenization methods, ranging from 350M to 2.4B parameters, and make all models and tokenized vocabularies publicly available. Their findings suggest that there is no single best tokenizer algorithm and that vocabulary size has little impact on downstream performance, challenging the current understanding of why BPE is particularly effective.

Tokenization Is More Than Compression

28 Feb 2024 | Craig W. Schmidt† Varshini Reddy† Haoran Zhang†‡ Alec Alameddine† Omri Uzan§ Yuval Pinter§ Chris Tanner†‡