ScaleDL: Towards Scalable and Efficient Runtime Prediction for Distributed Deep Learning Workloads
By: Xiaokai Wang , Shaoyuan Huang , Yuting Li and more
Potential Business Impact:
Predicts AI program speed accurately, saving time and money.
Deep neural networks (DNNs) form the cornerstone of modern AI services, supporting a wide range of applications, including autonomous driving, chatbots, and recommendation systems. As models increase in size and complexity, DNN workloads like training and inference tasks impose unprecedented demands on distributed computing resources, making the accurate prediction of runtime essential for optimizing development and resource allocation. Traditional methods rely on additive computational unit models, limiting their accuracy and generalizability. In contrast, graph-enhanced modeling improves performance but significantly increases data collection costs. Therefore, there is a critical need for a method that strikes a balance between accuracy, generalizability, and the costs of data collection. To address these challenges, we propose ScaleDL, a novel runtime prediction framework that combines nonlinear layer-wise modeling with graph neural network (GNN)-based cross-layer interaction mechanism, enabling accurate DNN runtime prediction and hierarchical generalizability across different network architectures. Additionally, we employ the D-optimal method to reduce data collection costs. Experiments on the workloads of five popular DNN models prove that ScaleDL enhances runtime prediction accuracy and generalizability, achieving 6$\times$ lower MRE and 5$\times$ lower RMSE compared to baseline models.
Similar Papers
The Art of Scaling Reinforcement Learning Compute for LLMs
Machine Learning (CS)
Helps AI learn better and faster.
Scaling DRL for Decision Making: A Survey on Data, Network, and Training Budget Strategies
Machine Learning (CS)
Makes robots learn faster and make better choices.
Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems
Information Retrieval
Makes online recommendations faster and better.