Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters
By: Abeda Sultana , Nabin Pakka , Fei Xu and more
Potential Business Impact:
Makes computer learning faster and better.
Scheduling deep learning (DL) models to train on powerful clusters with accelerators like GPUs and TPUs, presently falls short, either lacking fine-grained heterogeneity awareness or leaving resources substantially under-utilized. To fill this gap, we propose a novel design of a task-level heterogeneity-aware scheduler, Hadar, based on an optimization framework that can boost resource utilization. Hadar leverages the performance traits of DL jobs on a heterogeneous DL cluster, characterizes the task-level performance heterogeneity in the optimization problem, and makes scheduling decisions across both spatial and temporal dimensions. It involves the primal-dual framework employing a dual subroutine, to solve the optimization problem and guide the scheduling design. Our trace-driven simulation with representative DL model training workloads demonstrates that Hadar accelerates the total time duration by 1.20x when compared with its state-of-the-art heterogeneity-aware counterpart, Gavel. Further, our Hadar scheduler is enhanced to HadarE by forking each job into multiple copies to let a job train concurrently on heterogeneous GPUs resided on separate available nodes (i.e., machines or servers) for resource utilization enhancement. HadarE is evaluated extensively on physical DL clusters for comparison with Hadar and Gavel. With substantial enhancement in cluster resource utilization (by 1.45x), HadarE exhibits considerable speed-ups in DL model training, reducing the total time duration by 50% (or 80%) on an Amazon's AWS (or our lab) cluster, while producing trained DL models with consistently better inference quality than those trained by Hadar.
Similar Papers
Semantic-Aware Scheduling for GPU Clusters with Large Language Models
Machine Learning (CS)
Makes computer jobs finish much faster.
Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters
Distributed, Parallel, and Cluster Computing
Makes computer jobs run faster and use less power.
Puzzle: Scheduling Multiple Deep Learning Models on Mobile Device with Heterogeneous Processors
Machine Learning (CS)
Lets phones run many AI tasks faster.