Understanding Getting the most out of your tokenizer for pre-training and domain adaptation

Tokenization is a critical but often overlooked component in modern language models (LLMs). Most published works use a single tokenizer for all experiments, without optimizing it for specific tasks or fine-tuning. This paper highlights that the size, pre-tokenization regular expression, and training data of a tokenizer can significantly impact model performance, including generation speed, effective context size, memory usage, and downstream performance. The authors train specialized Byte-Pair Encoding (BPE) code tokenizers and conduct extensive ablations to study the impact of tokenizer design on LLMs for code generation tasks. They find that changing the tokenizer can lead to substantial gains in generation speed and effective context size when fine-tuning on more than 50 billion tokens. The paper also provides recommendations for selecting and optimizing tokenizer hyper-parameters and switching tokenizers in pre-trained LLMs. The experiments are conducted on models trained from scratch and pre-trained models, demonstrating their applicability across a wide range of use cases.Tokenization is a critical but often overlooked component in modern language models (LLMs). Most published works use a single tokenizer for all experiments, without optimizing it for specific tasks or fine-tuning. This paper highlights that the size, pre-tokenization regular expression, and training data of a tokenizer can significantly impact model performance, including generation speed, effective context size, memory usage, and downstream performance. The authors train specialized Byte-Pair Encoding (BPE) code tokenizers and conduct extensive ablations to study the impact of tokenizer design on LLMs for code generation tasks. They find that changing the tokenizer can lead to substantial gains in generation speed and effective context size when fine-tuning on more than 50 billion tokens. The paper also provides recommendations for selecting and optimizing tokenizer hyper-parameters and switching tokenizers in pre-trained LLMs. The experiments are conducted on models trained from scratch and pre-trained models, demonstrating their applicability across a wide range of use cases.

Getting the most out of your tokenizer for pre-training and domain adaptation

7 Feb 2024 | Gautier Dagan, Gabriel Synnaeve, Baptiste Rozière