Score: 1

Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters

Published: March 13, 2025 | arXiv ID: 2503.10918v2

By: Abeda Sultana , Nabin Pakka , Fei Xu and more

Potential Business Impact:

Makes computer learning faster and better.

Business Areas:

Cloud Computing Internet Services, Software

Scheduling deep learning (DL) models to train on powerful clusters with accelerators like GPUs and TPUs, presently falls short, either lacking fine-grained heterogeneity awareness or leaving resources substantially under-utilized. To fill this gap, we propose a novel design of a task-level heterogeneity-aware scheduler, Hadar, based on an optimization framework that can boost resource utilization. Hadar leverages the performance traits of DL jobs on a heterogeneous DL cluster, characterizes the task-level performance heterogeneity in the optimization problem, and makes scheduling decisions across both spatial and temporal dimensions. It involves the primal-dual framework employing a dual subroutine, to solve the optimization problem and guide the scheduling design. Our trace-driven simulation with representative DL model training workloads demonstrates that Hadar accelerates the total time duration by 1.20x when compared with its state-of-the-art heterogeneity-aware counterpart, Gavel. Further, our Hadar scheduler is enhanced to HadarE by forking each job into multiple copies to let a job train concurrently on heterogeneous GPUs resided on separate available nodes (i.e., machines or servers) for resource utilization enhancement. HadarE is evaluated extensively on physical DL clusters for comparison with Hadar and Gavel. With substantial enhancement in cluster resource utilization (by 1.45x), HadarE exhibits considerable speed-ups in DL model training, reducing the total time duration by 50% (or 80%) on an Amazon's AWS (or our lab) cluster, while producing trained DL models with consistently better inference quality than those trained by Hadar.

Semantic-Aware Scheduling for GPU Clusters with Large Language Models

Machine Learning (CS)

Makes computer jobs finish much faster.

2 Oct 2025 1

88%

Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Distributed, Parallel, and Cluster Computing

Makes computer jobs run faster and use less power.

11 Dec 2025 1

87%

Puzzle: Scheduling Multiple Deep Learning Models on Mobile Device with Heterogeneous Processors

Machine Learning (CS)

Lets phones run many AI tasks faster.

25 Aug 2025 2

View PDF Login to Bookmark

Page Count

14 pages

Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters

Makes computer learning faster and better.

Technical Abstract

Semantic-Aware Scheduling for GPU Clusters with Large Language Models

Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Puzzle: Scheduling Multiple Deep Learning Models on Mobile Device with Heterogeneous Processors