Score: 0

From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models

Published: November 4, 2025 | arXiv ID: 2511.02248v1

By: Xingqi Cui , Chieh-Jan Mike Liang , Jiarong Xing and more

Potential Business Impact:

Makes AI models run faster and cheaper.

Business Areas:

Big Data Data and Analytics

Serving large generative models such as LLMs and multi- modal transformers requires balancing user-facing SLOs (e.g., time-to-first-token, time-between-tokens) with provider goals of efficiency and cost reduction. Existing solutions rely on static provisioning or model-level autoscaling, both of which treat the model as a monolith. This coarse-grained resource management leads to degraded performance or significant resource underutilization due to poor adaptability to dynamic inference traffic that is common online. The root cause of this inefficiency lies in the internal structure of generative models: they are executed as graphs of interconnected operators. Through detailed characterization and systematic analysis, we find that operators are heterogeneous in their compute and memory footprints and exhibit diverse sensitivity to workload and resource factors such as batch size, sequence length, and traffic rate. This heterogeneity suggests that the operator, rather than the entire model, is the right granularity for scaling decisions. We propose an operator-level autoscaling framework, which allocates resources at finer (operator)-granularity, optimizing the scaling, batching, and placement based on individual operator profiles. Evaluated on production-scale traces, our approach preserves SLOs with up to 40% fewer GPUs and 35% less energy, or under fixed resources achieves 1.6x higher throughput with 5% less energy. These results show that the operator, rather than the model, is fundamentally a more effective unit for scaling large generative workloads.

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Distributed, Parallel, and Cluster Computing

Makes AI models run faster and cheaper.

27 Aug 2025 2

87%

TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity

Distributed, Parallel, and Cluster Computing

Makes AI answer questions much faster.

3 Dec 2025 2

87%

ScaleDL: Towards Scalable and Efficient Runtime Prediction for Distributed Deep Learning Workloads

Machine Learning (CS)

Predicts AI program speed accurately, saving time and money.

6 Nov 2025 0

View PDF Login to Bookmark

Page Count

16 pages

From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models

Makes AI models run faster and cheaper.

Technical Abstract

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity

ScaleDL: Towards Scalable and Efficient Runtime Prediction for Distributed Deep Learning Workloads