Getting the most out of your tokenizer for pre-training and domain adaptation

Getting the most out of your tokenizer for pre-training and domain adaptation

7 Feb 2024 | Gautier Dagan¹ Gabriel Synnaeve² Baptiste Rozière²
This paper investigates the impact of tokenizer design on the performance of large language models (LLMs), particularly in code generation tasks. The authors show that the size, pre-tokenization regular expressions, and training data of a tokenizer significantly affect the model's generation speed, effective context size, memory usage, and downstream performance. They train specialized Byte-Pair Encoding (BPE) tokenizers and conduct extensive ablations to analyze the impact of tokenizer design on LLMs for code generation tasks such as HumanEval and MBPP. The study recommends tokenizer hyper-parameter selection and switching in pre-trained LLMs. Experiments are performed on models trained from scratch and from pre-trained models, verifying their applicability to a wide range of use cases. The results show that when fine-tuning on more than 50 billion tokens, the tokenizer of a pre-trained LLM can be specialized to gain significant improvements in generation speed and effective context size. The paper discusses three main factors that impact tokenizer compression: the data used to train the tokenizer, the pre-tokenization scheme, and the vocabulary size. The authors propose methods to calculate inference and memory optimal vocabulary sizes. They also analyze the impact of fine-tuning and training from scratch with different tokenizers on downstream code generation performance. The study finds that changing the tokenizer during fine-tuning on 50B tokens or more has little impact on downstream performance but significantly improves compression and inference speed. The paper also explores the influence of vocabulary size on downstream performance. It finds that larger vocabulary sizes do not necessarily lead to worse performance, and that there is an optimal vocabulary size that balances compression and performance. The study also compares different methods for updating the tokenizer of a pre-trained model, including Fast Vocabulary Transfer (FVT) and extending an existing tokenizer. The results show that FVT leads to noticeable improvements in downstream tasks. The paper concludes that tokenization is a critical component of LLMs, and that changing the tokenizer during fine-tuning can significantly improve compression and inference speed without sacrificing performance. The study recommends using the GPT-4 pre-tokenization regular expression for a good balance between compression and performance. The results support the thesis that, with long enough fine-tuning, tokenizers can be changed without sacrificing performance.This paper investigates the impact of tokenizer design on the performance of large language models (LLMs), particularly in code generation tasks. The authors show that the size, pre-tokenization regular expressions, and training data of a tokenizer significantly affect the model's generation speed, effective context size, memory usage, and downstream performance. They train specialized Byte-Pair Encoding (BPE) tokenizers and conduct extensive ablations to analyze the impact of tokenizer design on LLMs for code generation tasks such as HumanEval and MBPP. The study recommends tokenizer hyper-parameter selection and switching in pre-trained LLMs. Experiments are performed on models trained from scratch and from pre-trained models, verifying their applicability to a wide range of use cases. The results show that when fine-tuning on more than 50 billion tokens, the tokenizer of a pre-trained LLM can be specialized to gain significant improvements in generation speed and effective context size. The paper discusses three main factors that impact tokenizer compression: the data used to train the tokenizer, the pre-tokenization scheme, and the vocabulary size. The authors propose methods to calculate inference and memory optimal vocabulary sizes. They also analyze the impact of fine-tuning and training from scratch with different tokenizers on downstream code generation performance. The study finds that changing the tokenizer during fine-tuning on 50B tokens or more has little impact on downstream performance but significantly improves compression and inference speed. The paper also explores the influence of vocabulary size on downstream performance. It finds that larger vocabulary sizes do not necessarily lead to worse performance, and that there is an optimal vocabulary size that balances compression and performance. The study also compares different methods for updating the tokenizer of a pre-trained model, including Fast Vocabulary Transfer (FVT) and extending an existing tokenizer. The results show that FVT leads to noticeable improvements in downstream tasks. The paper concludes that tokenization is a critical component of LLMs, and that changing the tokenizer during fine-tuning can significantly improve compression and inference speed without sacrificing performance. The study recommends using the GPT-4 pre-tokenization regular expression for a good balance between compression and performance. The results support the thesis that, with long enough fine-tuning, tokenizers can be changed without sacrificing performance.
Reach us at info@study.space
[slides] Getting the most out of your tokenizer for pre-training and domain adaptation | StudySpace