Score: 0

GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping

Published: December 19, 2025 | arXiv ID: 2512.17570v1

By: Yikang Yue, Yishu Yin, Xuehai Qian

SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B. The code is open-sourced at https://github.com/npz7yyk/GreedySnake

GoCkpt: Gradient-Assisted Multi-Step overlapped Checkpointing for Efficient LLM Training

Operating Systems

Speeds up computer learning by saving progress faster.

10 Nov 2025 2

86%

Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage

Distributed, Parallel, and Cluster Computing

Trains big AI models faster using extra storage.

6 Jun 2025 2

85%

ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates

Distributed, Parallel, and Cluster Computing

Makes AI learn faster by sharing work smartly.

18 May 2025 0

View PDF Login to Bookmark

GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping

Technical Abstract

GoCkpt: Gradient-Assisted Multi-Step overlapped Checkpointing for Efficient LLM Training

Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage

ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates