MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall
By: Avinash Maurya , M. Mustafa Rafique , Franck Cappello and more
Potential Business Impact:
Trains giant AI models faster on less hardware.
Training LLMs larger than the aggregated memory of multiple GPUs is increasingly necessary due to the faster growth of LLM sizes compared to GPU memory. To this end, multi-tier host memory or disk offloading techniques are proposed by state of art. Despite advanced asynchronous multi-tier read/write strategies, such offloading strategies result in significant I/O overheads in the critical path of training, resulting in slower iterations. To this end, we propose MLP-Offload, a novel multi-level, multi-path offloading engine specifically designed for optimizing LLM training on resource-constrained setups by mitigating I/O bottlenecks. We make several key observations that drive the design of MLP-Offload, such as I/O overheads during the update dominate the iteration time; I/O bandwidth of the third-level remote storage tier remains unutilized; and, contention due to concurrent offloading amplifies I/O bottlenecks. Driven by these insights, we design and implement MLP-Offload to offload the optimizer states across multiple tiers in a cache-efficient and concurrency-controlled fashion to mitigate I/O bottlenecks during the backward and update phases. Evaluations on models up to 280B parameters shows that MLP-Offload achieves 2.5$\times$ faster iterations compared to the state-of-the-art LLM training runtimes.
Similar Papers
Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage
Distributed, Parallel, and Cluster Computing
Trains big AI models faster using extra storage.
SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips
Machine Learning (CS)
Makes AI learn much faster on new chips.
10Cache: Heterogeneous Resource-Aware Tensor Caching and Migration for LLM Training
Distributed, Parallel, and Cluster Computing
Makes AI learn much faster and cheaper.