PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

Authors

Xinyi Wan, Penghui Qi, Guangxing Huang, Jialin Li, Min Lin

Published on

March 03, 2025

Publisher

Preprint

Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19% acceleration with even lower memory consumption. The implementation is open-sourced at https://github.com/sail-sg/zero-bubble

Other publications

Balancing Pipeline Parallelism with Vocabulary Parallelism

Man Tsung Yeung, Penghui Qi, Min Lin, Xinyi Wan

2024

MLSys 2025

Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios. Our implementation is open-sourced at https://github.com/sail-sg/VocabularyParallelism.

JAX-XC: EXCHANGE CORRELATION FUNCTIONALS LIBRARY IN JAX

Kunhao Zheng, Min Lin

2023

Workshop on "Machine Learning for Materials" ICLR 2023

We present JAX-XC an open-source library that provides exchange-correlation functionals in Jax. JAX-XC is built from LIBXC, its correctness has been verified numerically against LIBXC. Thanks to Jax, JAX-XC is end-to-end differentiable, computationally more efficient thanks to the vectorization provided by XLA, and also portable on various accelerators. More importantly, as more research is focusing on machine learning for density functional theory, we hope that JAX-XC could serve as a deep learning-friendly tool and a stepping-stone for researchers working in the intersection of deep learning and density functional theory.

Download