Score: 1

Adaptive Orchestration for Large-Scale Inference on Heterogeneous Accelerator Systems Balancing Cost, Performance, and Resilience

Published: March 25, 2025 | arXiv ID: 2503.20074v2

By: Yahav Biran, Imry Kissos

Potential Business Impact:

Saves money running AI by using the best computer parts.

Business Areas:

Intelligent Systems Artificial Intelligence, Data and Analytics, Science and Engineering

The surge in generative AI workloads has created a need for scalable inference systems that can flexibly harness both GPUs and specialized accelerators while containing operational costs. This paper proposes a hardware-agnostic control loop that adaptively allocates requests across heterogeneous accelerators based on real-time cost and capacity signals. The approach sustains low latency and high throughput by dynamically shifting between cost-optimized and capacity-optimized modes, ensuring the most efficient use of expensive compute resources under fluctuating availability. Evaluated using the Stable Diffusion model, the framework consistently meets latency targets, automatically redirects traffic during capacity shortfalls, and capitalizes on lower-cost accelerators when possible. These results highlight how a feedback-driven deployment strategy, spanning the entire software and hardware stack, can help organizations efficiently scale generative AI workloads while maintaining resilience in the face of limited accelerator capacity.

Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling

Distributed, Parallel, and Cluster Computing

Makes AI services faster and use less power.

16 Apr 2025 0

89%

AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies

Hardware Architecture

Finds best computer chips for AI tasks.

13 May 2025 0

88%

GOGH: Correlation-Guided Orchestration of GPUs in Heterogeneous Clusters

Distributed, Parallel, and Cluster Computing

Smarter computers use old and new parts well.

17 Oct 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

14 pages

Adaptive Orchestration for Large-Scale Inference on Heterogeneous Accelerator Systems Balancing Cost, Performance, and Resilience

Saves money running AI by using the best computer parts.

Technical Abstract

Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling

AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies

GOGH: Correlation-Guided Orchestration of GPUs in Heterogeneous Clusters