Score: 1

HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

Published: August 21, 2025 | arXiv ID: 2508.15919v1

By: Zahra Yousefijamarani , Xinglu Wang , Qian Wang and more

Potential Business Impact:

Makes AI answer questions faster and cheaper.

Business Areas:

Cloud Computing Internet Services, Software

Modern large language model (LLM) serving systems face challenges from highly variable requests with diverse lengths, priorities, and stage-specific service-level objectives (SLOs). Meeting these requires real-time scheduling, rapid and cost-effective scaling, and support for both collocated and disaggregated Prefill/Decode (P/D) architectures. We present \textbf{HyperFlexis}, a unified LLM serving system that integrates algorithmic and system-level innovations to jointly optimize scheduling and scaling under multiple SLOs. It features a multi-SLO-aware scheduler that leverages budget estimation and request prioritization to ensure proactive SLO compliance for both new and ongoing requests. The system supports prefill- and decode-stage multi-SLO scheduling for P/D-disaggregated architectures and KV cache transfers. It also enables cost-effective scaling decisions, prefill-decode instance linking during scaling, and rapid P/D role transitions. To accelerate scaling and reduce cold-start latency, a device-to-device (D2D) weight transfer mechanism is proposed that lowers weight loading overhead by up to \textbf{19.39$\times$}. These optimizations allow the system to achieve up to \textbf{4.44$\times$} higher SLO attainment, \textbf{65.82\%} lower request latency, and cost parity with state-of-the-art baselines. The code will be released soon.

DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving

Distributed, Parallel, and Cluster Computing

Makes AI talk faster and handle more requests.

12 Apr 2025 3

88%

OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving

Distributed, Parallel, and Cluster Computing

Makes AI answer questions faster and cheaper.

26 Nov 2025 1

88%

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Distributed, Parallel, and Cluster Computing

Makes AI models run faster and cheaper.

27 Aug 2025 2

View PDF Login to Bookmark

Page Count

31 pages

HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

Makes AI answer questions faster and cheaper.

Technical Abstract

DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving

OOCO: Latency-disaggregated Architecture for Online-Offline Co-locate LLM Serving

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference