PROSERVE: Unified Multi-Priority Request Scheduling for LLM Serving
By: Weizhe Huang , Tao Peng , Tongxuan Liu and more
The widespread deployment of large language models (LLMs) for interactive applications necessitates serving systems that can handle thousands of concurrent requests with diverse Service Level Objective (SLO) requirements. A critical yet often overlooked dimension in this context is the inherent priority difference among clients; for instance, business-critical functions demand higher performance guarantees, as fulfilling such requests yields significantly greater business value. However, existing LLM serving schedulers fail to jointly optimize for both SLO attainment and client-level priorities. To bridge this gap, we first \textit{formalize multi-priority request scheduling as a service gain maximization problem}, where satisfying latency requirements for requests of different priorities contributes varying levels of gain. We then propose PROSERVE, a unified two-tier scheduling framework designed to maximize overall service gain. At the engine level, SlideBatching dynamically adapts batch formation and request ordering under varying load conditions, employing a sliding boundary mechanism to balance deadline-first and density-first strategies. At the service level, GoRouting performs gain-oriented and capability-aware dispatching across distributed instances, proactively reserving capacity for future high-priority or long requests. Extensive evaluation across four open-source datasets and a real-world industrial trace demonstrates that \systemname{} consistently outperforms state-of-the-art baselines, improving system gain by up to 35% and boosting SLO attainment by up to 52%.
Similar Papers
SLOs-Serve: Optimized Serving of Multi-SLO LLMs
Distributed, Parallel, and Cluster Computing
Makes AI answer questions much faster.
Tempo: Application-aware LLM Serving with Mixed SLO Requirements
Distributed, Parallel, and Cluster Computing
Makes AI answer questions faster and better.
SLO-Aware Scheduling for Large Language Model Inferences
Distributed, Parallel, and Cluster Computing
Makes AI answer questions faster and better.