Greed is All You Need: An Evaluation of Tokenizer Inference Methods

Greed is All You Need: An Evaluation of Tokenizer Inference Methods

2024 | Omri Uzan, Craig W. Schmidt, Chris Tanner, Yuval Pinter
This paper evaluates seven tokenizer inference methods across four different algorithms and three vocabulary sizes, using a novel intrinsic evaluation suite for English that combines measures from morphology, cognition, and information theory. The study shows that greedy inference performs surprisingly well for commonly used tokenizers, and that SaGe, a recently introduced contextually-informed tokenizer, outperforms others in morphological alignment. The paper also demonstrates that inference methods minimizing token count perform best according to cognitive metrics. The paper introduces an evaluation suite that aggregates intrinsic benchmarks from various theoretical domains. It finds that greedy inference methods work well across all four vocabularies in terms of morphological and information-theoretic metrics. Additionally, SaGe achieves state-of-the-art performance according to morphological metrics, and inference methods that minimize token count perform best according to cognitive metrics. The paper evaluates four tokenizer vocabularies: BPE, UnigramLM, WordPiece, and SaGe. It uses the train split of the MiniPile dataset to construct the tokenizer vocabularies and trains vocabularies of sizes 32,768, 40,960, and 49,152 using the HuggingFace Tokenizers library. The results show that merge rules-based inference methods significantly differ in morphological alignment. Likelihood-based inference methods show that frequently-used tokens have very high likelihood values, sometimes exceeding those of the gold-standard segments. The least tokens strategy performs well on the token count metric and cognitive measures, suggesting a human preference for minimal word segmentation. The paper also finds that likelihood-based inference performs poorly in terms of Rényi efficiency, contrary to its stated purpose. Dropout performs well on this measure, in line with its goal. Longest suffix performs poorly across the board, possibly due to the suffixing nature of the English language. The results show that BPE is inferior to UnigramLM on morphology alignment, but some of this gap can be attributed to the inference method rather than the vocabulary. SaGe is most aligned to morphology by a substantial margin, indicating that its contextualized objective succeeds in retaining meaningful tokens in the vocabulary during ablation. The paper also notes that the two likelihood-based vocabularies follow the exact same within-vocab trends, and those for the two information-based vocabularies are also very close. This highlights the consistency and robustness of the benchmark. The paper concludes that greedy inference is a good choice, especially for morphologically-motivated tasks, even for tokenizers trained on other objectives. It also suggests that the benchmark can be used in LM training efforts as a fruitful first step to improve tokenization schemes or to select inference methods on-line. The paper also notes that the evaluation is limited to English, a language with relatively low morphological complexity, and that future work should aim to evaluate tokenization methods across a diverse array of languages.This paper evaluates seven tokenizer inference methods across four different algorithms and three vocabulary sizes, using a novel intrinsic evaluation suite for English that combines measures from morphology, cognition, and information theory. The study shows that greedy inference performs surprisingly well for commonly used tokenizers, and that SaGe, a recently introduced contextually-informed tokenizer, outperforms others in morphological alignment. The paper also demonstrates that inference methods minimizing token count perform best according to cognitive metrics. The paper introduces an evaluation suite that aggregates intrinsic benchmarks from various theoretical domains. It finds that greedy inference methods work well across all four vocabularies in terms of morphological and information-theoretic metrics. Additionally, SaGe achieves state-of-the-art performance according to morphological metrics, and inference methods that minimize token count perform best according to cognitive metrics. The paper evaluates four tokenizer vocabularies: BPE, UnigramLM, WordPiece, and SaGe. It uses the train split of the MiniPile dataset to construct the tokenizer vocabularies and trains vocabularies of sizes 32,768, 40,960, and 49,152 using the HuggingFace Tokenizers library. The results show that merge rules-based inference methods significantly differ in morphological alignment. Likelihood-based inference methods show that frequently-used tokens have very high likelihood values, sometimes exceeding those of the gold-standard segments. The least tokens strategy performs well on the token count metric and cognitive measures, suggesting a human preference for minimal word segmentation. The paper also finds that likelihood-based inference performs poorly in terms of Rényi efficiency, contrary to its stated purpose. Dropout performs well on this measure, in line with its goal. Longest suffix performs poorly across the board, possibly due to the suffixing nature of the English language. The results show that BPE is inferior to UnigramLM on morphology alignment, but some of this gap can be attributed to the inference method rather than the vocabulary. SaGe is most aligned to morphology by a substantial margin, indicating that its contextualized objective succeeds in retaining meaningful tokens in the vocabulary during ablation. The paper also notes that the two likelihood-based vocabularies follow the exact same within-vocab trends, and those for the two information-based vocabularies are also very close. This highlights the consistency and robustness of the benchmark. The paper concludes that greedy inference is a good choice, especially for morphologically-motivated tasks, even for tokenizers trained on other objectives. It also suggests that the benchmark can be used in LM training efforts as a fruitful first step to improve tokenization schemes or to select inference methods on-line. The paper also notes that the evaluation is limited to English, a language with relatively low morphological complexity, and that future work should aim to evaluate tokenization methods across a diverse array of languages.
Reach us at info@study.space