Score: 2

Zeppelin: Balancing Variable-length Workloads in Data Parallel Large Model Training

Published: September 26, 2025 | arXiv ID: 2509.21841v2

By: Chang Chen , Tiancheng Chen , Jiangfei Duan and more

Potential Business Impact:

Trains AI faster by fixing computer work jams.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Training large language models (LLMs) with increasingly long and varying sequence lengths introduces severe load imbalance challenges in large-scale data-parallel training. Recent frameworks attempt to mitigate these issues through data reorganization or hybrid parallel strategies. However, they often overlook how computational and communication costs scale with sequence length, resulting in suboptimal performance. We identify three critical challenges: (1) varying computation-to-communication ratios across sequences of different lengths in distributed attention, (2) mismatch between static NIC-GPU affinity and dynamic parallel workloads, and (3) distinct optimal partitioning strategies required for quadratic attention versus linear components. To address these challenges, we present Zeppelin, a novel training system that integrates three key techniques: (1) a hierarchical sequence partitioning method for the attention module that reduces communication overhead and balances computation, supported by an efficient attention engine that applies divergent parallel strategies; (2) a routing layer that orchestrates inter-node transfers to fully utilize NIC bandwidth; and (3) a remapping layer that transforms sequence layouts between attention and linear modules, ensuring high computational efficiency across both. Comprehensive evaluations across diverse configurations show that Zeppelin delivers an average 2.80x speedup over state-of-the-art methods.

Characterizing Communication Patterns in Distributed Large Language Model Inference

Distributed, Parallel, and Cluster Computing

Makes AI talk faster by fixing how computers share info.

18 Jul 2025 0

89%

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

Distributed, Parallel, and Cluster Computing

Makes AI learn faster on supercomputers.

8 Jan 2025 0

88%

System-performance and cost modeling of Large Language Model training and inference

Hardware Architecture

Makes big AI models train and run cheaper.

3 Jul 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇸🇬 🇨🇭 China, Switzerland, Singapore

Page Count

15 pages

Zeppelin: Balancing Variable-length Workloads in Data Parallel Large Model Training

Trains AI faster by fixing computer work jams.

Technical Abstract

Characterizing Communication Patterns in Distributed Large Language Model Inference

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

System-performance and cost modeling of Large Language Model training and inference