Score: 2

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Published: August 27, 2025 | arXiv ID: 2508.19559v1

By: Rongzhi Li , Ruogu Du , Zefang Chu and more

BigTech Affiliations: ByteDance

Potential Business Impact:

Makes AI models run faster and cheaper.

Business Areas:

Data Center Automation Hardware, Information Technology, Software

Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between prefill and decode stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale prefill and decode pools, HeteroScale maintains architectural balance while ensuring efficient, adaptive resource management. Deployed in a massive production environment on tens of thousands of GPUs, HeteroScale has proven its effectiveness, increasing average GPU utilization by a significant 26.6 percentage points and saving hundreds of thousands of GPU-hours daily, all while upholding stringent service level objectives.

TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity

Distributed, Parallel, and Cluster Computing

Makes AI answer questions much faster.

3 Dec 2025 2

90%

Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs

Distributed, Parallel, and Cluster Computing

Makes AI run faster on different computers.

22 Sep 2025 0

90%

Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs

Distributed, Parallel, and Cluster Computing

Makes AI faster and cheaper using different computer parts.

22 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

24 pages

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Makes AI models run faster and cheaper.

Technical Abstract

TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity

Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs

Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs