A Comprehensive Study of Bugs in Modern Distributed Deep Learning Systems
By: Xiaoxue Ma , Wanwei Zhan , Jiale Chen and more
In today's data-driven era, deep learning is vital for processing massive datasets, yet single-device training is constrained by computational and memory limits. Distributed deep learning overcomes these challenges by leveraging multiple GPUs or machines in parallel. While general-purpose frameworks (e.g., TensorFlow and PyTorch) provide distributed capabilities, these are often add-on features that demand significant manual effort for advanced parallelism, underscoring the need for specialized frameworks. This study conducts the first large-scale empirical analysis of practitioner challenges in dedicated distributed frameworks. We examine 849 real-world issues from DeepSpeed, Megatron-LM, and Colossal-AI and construct a taxonomy of 34 bug symptoms, 28 root causes, and 6 fix patterns. Crucially, we establish explicit mappings between symptoms, causes, and fixes across distributed training stages, enabling a systematic understanding of how issues emerge and are resolved. Our results show that 45.1\% of bug symptoms are unique to distributed frameworks, with setup failures, memory issues, and performance anomalies being the most prevalent. Moreover, 95\% of issues in the communication setup stage occur exclusively in distributed contexts. We also find over 60\% of cases can be resolved through version and dependency management, and distributed feature, API, and communication tuning. Based on these findings, we provide actionable implications.
Similar Papers
Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models
Software Engineering
Finds and fixes bugs in AI training tools.
Distributed Deep Learning using Stochastic Gradient Staleness
Machine Learning (CS)
Trains computer brains much faster for learning.
High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments
Distributed, Parallel, and Cluster Computing
Teaches computers to learn from lots of information.