GFS: A Preemption-aware Scheduling Framework for GPU Clusters with Predictive Spot Instance Management
By: Jiaang Duan , Shenglin Xu , Shiyou Qian and more
Potential Business Impact:
Saves money by better sharing computer power.
The surge in large language models (LLMs) has fundamentally reshaped the landscape of GPU usage patterns, creating an urgent need for more efficient management strategies. While cloud providers employ spot instances to reduce costs for low-priority (LP) tasks, existing schedulers still grapple with high eviction rates and lengthy queuing times. To address these limitations, we present GFS, a novel preemptive scheduling framework that enhances service-level objective (SLO) compliance for high-priority (HP) tasks while minimizing preemptions to LP tasks. Firstly, GFS utilizes a lightweight forecasting model that predicts GPU demand among different tenants, enabling proactive resource management. Secondly, GFS employs a dynamic allocation mechanism to adjust the spot quota for LP tasks with guaranteed durations. Lastly, GFS incorporates a preemptive scheduling policy that prioritizes HP tasks while minimizing the impact on LP tasks. We demonstrate the effectiveness of GFS through both real-world implementation and simulations. The results show that GFS reduces eviction rates by 33.0\%, and cuts queuing delays by 44.1\% for LP tasks. Furthermore, GFS enhances the GPU allocation rate by up to 22.8\% in real production clusters. In a production cluster of more than 10,000 GPUs, GFS yields roughly \$459,715 in monthly benefits.
Similar Papers
FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs
Distributed, Parallel, and Cluster Computing
Saves energy and pollution by smart computer use.
FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs
Distributed, Parallel, and Cluster Computing
Saves energy and pollution by smart computer use.
GreenLLM: SLO-Aware Dynamic Frequency Scaling for Energy-Efficient LLM Serving
Performance
Saves energy when computers think big thoughts.