Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Authors

Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong

Published on

December 10, 2024

Publisher

Neural Information Processing Systems (NeurIPS), 2024

We investigate how vocabulary size impacts language model scaling laws by training models ranging from 33M to 3B parameters on up to over 500B characters with various vocabulary configurations. We find that the optimal vocabulary depends on the compute budget, thus we propose three approaches to determine it. All approaches suggest that vocabulary parameters should be scaled slower than non-vocabulary parameters. Nonetheless, vocabulary parameters are critical for performance and under-allocated in current LLMs. By adopting the vocabulary size predicted by our method instead of the conventional setting, we train better 3B parameter models in cases when the number of training data is 1) insufficient; 2) optimally allocated by compute budget; 3) overly sufficient. Our work reveals the underestimated role of vocabulary, and the necessity of jointly considering vocabulary size, model parameters, and training data for efficient scaling.

View Publication

Download

View Open Source

Other publications

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

Xinyi Wan, Penghui Qi, Guangxing Huang, Jialin Li, Min Lin

2025

Preprint

Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19% acceleration with even lower memory consumption. The implementation is open-sourced at https://github.com/sail-sg/zero-bubble

Balancing Pipeline Parallelism with Vocabulary Parallelism

Man Tsung Yeung, Penghui Qi, Min Lin, Xinyi Wan

2024

MLSys 2025

Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios. Our implementation is open-sourced at https://github.com/sail-sg/VocabularyParallelism.