Score: 2

LowDiff: Efficient Frequent Checkpointing via Low-Cost Differential for High-Performance Distributed Training Systems

Published: September 4, 2025 | arXiv ID: 2509.04084v1

By: Chenxuan Yao , Yuchong Hu , Feifan Liu and more

Potential Business Impact:

Saves computer training time and money.

Business Areas:

A/B Testing Data and Analytics

Distributed training of large deep-learning models often leads to failures, so checkpointing is commonly employed for recovery. State-of-the-art studies focus on frequent checkpointing for fast recovery from failures. However, it generates numerous checkpoints, incurring substantial costs and thus degrading training performance. Recently, differential checkpointing has been proposed to reduce costs, but it is limited to recommendation systems, so its application to general distributed training systems remains unexplored. This paper proposes LowDiff, an efficient frequent checkpointing framework that \textit{reuses} compressed gradients, serving as differential checkpoints to reduce cost. Furthermore, LowDiff incorporates a batched gradient write optimization to persist these differentials to storage efficiently. It also dynamically tunes both the checkpoint frequency and the batching size to maximize performance. We further enhance LowDiff with a layer-wise gradient reusing and snapshotting approach and a CPU-based asynchronous persistence strategy, enabling frequent checkpointing without gradient compression. Experiments on various workloads show that LowDiff can achieve checkpointing frequency up to per iteration with less than 3.1\% runtime overhead.

LowDiff: Efficient Diffusion Sampling with Low-Resolution Condition

CV and Pattern Recognition

Makes AI create pictures much faster.

18 Sep 2025 0

85%

GoCkpt: Gradient-Assisted Multi-Step overlapped Checkpointing for Efficient LLM Training

Operating Systems

Speeds up computer learning by saving progress faster.

10 Nov 2025 2

84%

Replay-Based Continual Learning with Dual-Layered Distillation and a Streamlined U-Net for Efficient Text-to-Image Generation

CV and Pattern Recognition

Makes AI art generators faster and smaller.

11 May 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

16 pages

LowDiff: Efficient Frequent Checkpointing via Low-Cost Differential for High-Performance Distributed Training Systems

Saves computer training time and money.

Technical Abstract

LowDiff: Efficient Diffusion Sampling with Low-Resolution Condition

GoCkpt: Gradient-Assisted Multi-Step overlapped Checkpointing for Efficient LLM Training

Replay-Based Continual Learning with Dual-Layered Distillation and a Streamlined U-Net for Efficient Text-to-Image Generation