1 Nov 2024 | Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong
This paper investigates the impact of vocabulary size on the scaling laws of large language models (LLMs). The authors propose three complementary approaches—IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function—to predict the optimal vocabulary size for a given computational budget. They find that larger models require larger vocabularies, but most LLMs use insufficient vocabulary sizes. For example, the optimal vocabulary size for Llama2-70B should have been at least 216K, 7 times larger than its actual 32K. Empirical validation shows that adopting the predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. The study highlights the importance of jointly considering tokenization and model scaling for efficient pre-training. The code and demo are available online.This paper investigates the impact of vocabulary size on the scaling laws of large language models (LLMs). The authors propose three complementary approaches—IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function—to predict the optimal vocabulary size for a given computational budget. They find that larger models require larger vocabularies, but most LLMs use insufficient vocabulary sizes. For example, the optimal vocabulary size for Llama2-70B should have been at least 216K, 7 times larger than its actual 32K. Empirical validation shows that adopting the predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. The study highlights the importance of jointly considering tokenization and model scaling for efficient pre-training. The code and demo are available online.