Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Authors

Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong

Published on

December 10, 2024

Publisher

Neural Information Processing Systems (NeurIPS), 2024

We investigate how vocabulary size impacts language model scaling laws by training models ranging from 33M to 3B parameters on up to over 500B characters with various vocabulary configurations. We find that the optimal vocabulary depends on the compute budget, thus we propose three approaches to determine it. All approaches suggest that vocabulary parameters should be scaled slower than non-vocabulary parameters. Nonetheless, vocabulary parameters are critical for performance and under-allocated in current LLMs. By adopting the vocabulary size predicted by our method instead of the conventional setting, we train better 3B parameter models in cases when the number of training data is 1) insufficient; 2) optimally allocated by compute budget; 3) overly sufficient. Our work reveals the underestimated role of vocabulary, and the necessity of jointly considering vocabulary size, model parameters, and training data for efficient scaling.