Metronome: Efficient Scheduling for Periodic Traffic Jobs with Network and Priority Awareness
By: Hao Jiang , Meng Qin , Ruijie Kuai and more
Potential Business Impact:
Makes computer networks share internet faster.
With the rapid growth in computing power demand, cloud native networks have emerged as a promising solution to address the challenges of efficient resource coordination, particularly in coping with the dynamic fluctuations of network bandwidth in clusters. We propose Metronome, a network-aware and priority-aware scheduling mechanism for cloud native networks. This mechanism is designed to support jobs that exhibit periodic traffic patterns and dynamic bandwidth demands, particularly in the context of distributed training. Specifically, Metronome employs a time-division multiplexing approach that leverages job traffic characteristics to construct an elastic network resource allocation model, enabling efficient bandwidth sharing across multiple jobs. In addition, it incorporates a multi-objective optimization strategy, jointly considering latency and job priorities to achieve globally optimal as well as dynamic resource allocation. Finally, Metronome adapts to the dynamic environment by monitoring the cluster and performing reconfiguration operations. Extensive experiments with 13 common machine learning models demonstrate that Metronome can enhance cluster resource utilization while guaranteeing service performance. Compared with the existing Kubernetes scheduling mechanisms across multiple scenarios, Metronome reduces job completion time by up to 19.50% while improving average bandwidth utilization by up to 23.20%.
Similar Papers
Learning to Schedule: A Supervised Learning Framework for Network-Aware Scheduling of Data-Intensive Workloads
Distributed, Parallel, and Cluster Computing
Makes computer jobs run faster by predicting delays.
QoS-aware Scheduling of Periodic Real-time Task Graphs on Heterogeneous Pre-occupied MECs
Distributed, Parallel, and Cluster Computing
Makes phones run apps faster, even when busy.
Optimal Multi-Constrained Workflow Scheduling for Cyber-Physical Systems in the Edge-Cloud Continuum
Networking and Internet Architecture
Makes smart devices work faster together.