Score: 1

MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall

Published: September 2, 2025 | arXiv ID: 2509.02480v1

By: Avinash Maurya , M. Mustafa Rafique , Franck Cappello and more

Potential Business Impact:

Trains giant AI models faster on less hardware.

Business Areas:

Multi-level Marketing Sales and Marketing

Training LLMs larger than the aggregated memory of multiple GPUs is increasingly necessary due to the faster growth of LLM sizes compared to GPU memory. To this end, multi-tier host memory or disk offloading techniques are proposed by state of art. Despite advanced asynchronous multi-tier read/write strategies, such offloading strategies result in significant I/O overheads in the critical path of training, resulting in slower iterations. To this end, we propose MLP-Offload, a novel multi-level, multi-path offloading engine specifically designed for optimizing LLM training on resource-constrained setups by mitigating I/O bottlenecks. We make several key observations that drive the design of MLP-Offload, such as I/O overheads during the update dominate the iteration time; I/O bandwidth of the third-level remote storage tier remains unutilized; and, contention due to concurrent offloading amplifies I/O bottlenecks. Driven by these insights, we design and implement MLP-Offload to offload the optimizer states across multiple tiers in a cache-efficient and concurrency-controlled fashion to mitigate I/O bottlenecks during the backward and update phases. Evaluations on models up to 280B parameters shows that MLP-Offload achieves 2.5$\times$ faster iterations compared to the state-of-the-art LLM training runtimes.

Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage

Distributed, Parallel, and Cluster Computing

Trains big AI models faster using extra storage.

6 Jun 2025 2

87%

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

Machine Learning (CS)

Makes AI learn much faster on new chips.

25 Sep 2025 1

87%

10Cache: Heterogeneous Resource-Aware Tensor Caching and Migration for LLM Training

Distributed, Parallel, and Cluster Computing

Makes AI learn much faster and cheaper.

18 Nov 2025 2

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

12 pages

MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall

Trains giant AI models faster on less hardware.

Technical Abstract

Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

10Cache: Heterogeneous Resource-Aware Tensor Caching and Migration for LLM Training