DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving
By: Chaoyi Ruan , Yinhe Chen , Dongqi Tian and more
Potential Business Impact:
Makes AI talk faster and handle more requests.
LLM inference must meet strict latency SLOs (e.g., 100 ms P99 time-between-tokens) while maximizing goodput. Yet, real-world variability in prompt and response lengths skews compute-intensive prefill and memory-bound decode phases, making both colocated (even with chunked prefill) and disaggregated deployments unable to simultaneously deliver low tail latency and high throughput. We introduce DynaServe, a high-performance LLM serving system built atop vLLM that unifies and extends both paradigms for maximizing goodput under SLO constraints, when handling unbalanced and dynamic workloads. It relies on a micro-request abstraction, which arbitrarily splits each request at any token boundary into at most two cooperating segments. A two-level scheduling framework then balances micro-request load across unified GPU instances. The global scheduler rapidly selects per-request split points by considering both the request's prefill/decode time ratio and the current load across GPU instances. The local schedulers on each GPU instance independently form SLO-aware batches, adjusting their composition in response to workload fluctuations, potential latency spikes and per-GPU under/over utilization. On real-world traces, DynaServe boosts the overall serving capacity from 1.15$\times$ to 3.07$\times$, improves goodput by up to 1.91$\times$ and 1.61$\times$, and improves the performance by up to 60\% in a hybrid workload under SLO compared to state-of-the-art colocated and disaggregated baselines.
Similar Papers
DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing
Machine Learning (CS)
Makes AI models answer questions much faster.
GreenLLM: SLO-Aware Dynamic Frequency Scaling for Energy-Efficient LLM Serving
Performance
Saves energy when computers think big thoughts.
HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling
Distributed, Parallel, and Cluster Computing
Makes AI answer questions faster and cheaper.