Score: 0

A Comprehensive Study of Bugs in Modern Distributed Deep Learning Systems

Published: December 23, 2025 | arXiv ID: 2512.20345v1

By: Xiaoxue Ma , Wanwei Zhan , Jiale Chen and more

In today's data-driven era, deep learning is vital for processing massive datasets, yet single-device training is constrained by computational and memory limits. Distributed deep learning overcomes these challenges by leveraging multiple GPUs or machines in parallel. While general-purpose frameworks (e.g., TensorFlow and PyTorch) provide distributed capabilities, these are often add-on features that demand significant manual effort for advanced parallelism, underscoring the need for specialized frameworks. This study conducts the first large-scale empirical analysis of practitioner challenges in dedicated distributed frameworks. We examine 849 real-world issues from DeepSpeed, Megatron-LM, and Colossal-AI and construct a taxonomy of 34 bug symptoms, 28 root causes, and 6 fix patterns. Crucially, we establish explicit mappings between symptoms, causes, and fixes across distributed training stages, enabling a systematic understanding of how issues emerge and are resolved. Our results show that 45.1\% of bug symptoms are unique to distributed frameworks, with setup failures, memory issues, and performance anomalies being the most prevalent. Moreover, 95\% of issues in the communication setup stage occur exclusively in distributed contexts. We also find over 60\% of cases can be resolved through version and dependency management, and distributed feature, API, and communication tuning. Based on these findings, we provide actionable implications.

Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models

Software Engineering

Finds and fixes bugs in AI training tools.

12 Jun 2025 1

87%

Distributed Deep Learning using Stochastic Gradient Staleness

Machine Learning (CS)

Trains computer brains much faster for learning.

6 Sep 2025 0

87%

High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments

Distributed, Parallel, and Cluster Computing

Teaches computers to learn from lots of information.

11 Dec 2025 0

View PDF Login to Bookmark

A Comprehensive Study of Bugs in Modern Distributed Deep Learning Systems

Technical Abstract

Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models

Distributed Deep Learning using Stochastic Gradient Staleness

High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments