Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

1 Nov 2024 | Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong
This paper investigates the impact of vocabulary size on the scaling laws of large language models (LLMs). Previous research has focused on model parameters and training data size, but has overlooked the role of vocabulary size. The authors propose three complementary approaches to predict the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Their analysis shows that the optimal vocabulary size depends on the compute budget, with larger models requiring larger vocabularies. They validate their predictions by training models with 3B parameters across different FLOPs budgets and show that adopting their predicted optimal vocabulary size consistently improves downstream performance. For example, increasing the vocabulary size from the conventional 32K to 43K improves performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. The authors highlight the importance of jointly considering tokenization and model scaling for efficient pre-training. The code and demo are available at https://github.com/sail-sg/scaling-with-vocab and https://hf.co/spaces/sail/scaling-with-vocab-demo.This paper investigates the impact of vocabulary size on the scaling laws of large language models (LLMs). Previous research has focused on model parameters and training data size, but has overlooked the role of vocabulary size. The authors propose three complementary approaches to predict the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Their analysis shows that the optimal vocabulary size depends on the compute budget, with larger models requiring larger vocabularies. They validate their predictions by training models with 3B parameters across different FLOPs budgets and show that adopting their predicted optimal vocabulary size consistently improves downstream performance. For example, increasing the vocabulary size from the conventional 32K to 43K improves performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. The authors highlight the importance of jointly considering tokenization and model scaling for efficient pre-training. The code and demo are available at https://github.com/sail-sg/scaling-with-vocab and https://hf.co/spaces/sail/scaling-with-vocab-demo.
Reach us at info@study.space
[slides and audio] Scaling Laws with Vocabulary%3A Larger Models Deserve Larger Vocabularies