Score: 0

ScaleDL: Towards Scalable and Efficient Runtime Prediction for Distributed Deep Learning Workloads

Published: November 6, 2025 | arXiv ID: 2511.04162v1

By: Xiaokai Wang , Shaoyuan Huang , Yuting Li and more

Potential Business Impact:

Predicts AI program speed accurately, saving time and money.

Business Areas:

Predictive Analytics Artificial Intelligence, Data and Analytics, Software

Deep neural networks (DNNs) form the cornerstone of modern AI services, supporting a wide range of applications, including autonomous driving, chatbots, and recommendation systems. As models increase in size and complexity, DNN workloads like training and inference tasks impose unprecedented demands on distributed computing resources, making the accurate prediction of runtime essential for optimizing development and resource allocation. Traditional methods rely on additive computational unit models, limiting their accuracy and generalizability. In contrast, graph-enhanced modeling improves performance but significantly increases data collection costs. Therefore, there is a critical need for a method that strikes a balance between accuracy, generalizability, and the costs of data collection. To address these challenges, we propose ScaleDL, a novel runtime prediction framework that combines nonlinear layer-wise modeling with graph neural network (GNN)-based cross-layer interaction mechanism, enabling accurate DNN runtime prediction and hierarchical generalizability across different network architectures. Additionally, we employ the D-optimal method to reduce data collection costs. Experiments on the workloads of five popular DNN models prove that ScaleDL enhances runtime prediction accuracy and generalizability, achieving 6$\times$ lower MRE and 5$\times$ lower RMSE compared to baseline models.

The Art of Scaling Reinforcement Learning Compute for LLMs

Machine Learning (CS)

Helps AI learn better and faster.

15 Oct 2025 1

87%

Scaling DRL for Decision Making: A Survey on Data, Network, and Training Budget Strategies

Machine Learning (CS)

Makes robots learn faster and make better choices.

5 Aug 2025 1

87%

Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems

Information Retrieval

Makes online recommendations faster and better.

13 Jun 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

6 pages

ScaleDL: Towards Scalable and Efficient Runtime Prediction for Distributed Deep Learning Workloads

Predicts AI program speed accurately, saving time and money.

Technical Abstract

The Art of Scaling Reinforcement Learning Compute for LLMs

Scaling DRL for Decision Making: A Survey on Data, Network, and Training Budget Strategies

Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems