Score: 0

ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates

Published: May 18, 2025 | arXiv ID: 2505.12242v3

By: Tingfeng Lan , Yusen Wu , Bin Ma and more

Potential Business Impact:

Makes AI learn faster by sharing work smartly.

Business Areas:

GPU Hardware

Fine-tuning large language models (LLMs) often exceeds GPU memory limits, prompting systems to offload model states to CPU memory. However, existing offloaded training frameworks like ZeRO-Offload treat all parameters equally and update the full model on the CPU, causing severe GPU stalls, where fast, expensive GPUs sit idle waiting for slow CPU updates and limited-bandwidth PCIe transfers. We present ZenFlow, a new offloading framework that prioritizes important parameters and decouples updates between GPU and CPU. ZenFlow performs in-place updates of important gradients on GPU, while asynchronously offloading and accumulating less important ones on CPU, fully overlapping CPU work with GPU computation. To scale across GPUs, ZenFlow introduces a lightweight gradient selection method that exploits a novel spatial and temporal locality property of important gradients, avoiding costly global synchronization. ZenFlow achieves up to 5x end-to-end speedup, 2x lower PCIe traffic, and reduces GPU stalls by over 85 percent, all while preserving accuracy.

ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

Machine Learning (CS)

Lets huge AI models train on small computers.

16 Mar 2025 1

86%

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

Distributed, Parallel, and Cluster Computing

Makes AI learn faster on supercomputers.

8 Jan 2025 0

86%

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

Machine Learning (CS)

Makes AI learn much faster on new chips.

25 Sep 2025 1

View PDF Login to Bookmark

Page Count

15 pages

ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates

Makes AI learn faster by sharing work smartly.

Technical Abstract

ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips