Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
By: Rongzhi Li , Ruogu Du , Zefang Chu and more
Potential Business Impact:
Makes AI models run faster and cheaper.
Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between prefill and decode stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale prefill and decode pools, HeteroScale maintains architectural balance while ensuring efficient, adaptive resource management. Deployed in a massive production environment on tens of thousands of GPUs, HeteroScale has proven its effectiveness, increasing average GPU utilization by a significant 26.6 percentage points and saving hundreds of thousands of GPU-hours daily, all while upholding stringent service level objectives.
Similar Papers
TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity
Distributed, Parallel, and Cluster Computing
Makes AI answer questions much faster.
Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs
Distributed, Parallel, and Cluster Computing
Makes AI run faster on different computers.
Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs
Distributed, Parallel, and Cluster Computing
Makes AI faster and cheaper using different computer parts.