Score: 0

Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling

Published: April 16, 2025 | arXiv ID: 2504.15296v1

By: Yihong Jin, Ze Yang

Potential Business Impact:

Makes AI services faster and use less power.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

The rapid expansion of AI inference services in the cloud necessitates a robust scalability solution to manage dynamic workloads and maintain high performance. This study proposes a comprehensive scalability optimization framework for cloud AI inference services, focusing on real-time load balancing and autoscaling strategies. The proposed model is a hybrid approach that combines reinforcement learning for adaptive load distribution and deep neural networks for accurate demand forecasting. This multi-layered approach enables the system to anticipate workload fluctuations and proactively adjust resources, ensuring maximum resource utilisation and minimising latency. Furthermore, the incorporation of a decentralised decision-making process within the model serves to enhance fault tolerance and reduce response time in scaling operations. Experimental results demonstrate that the proposed model enhances load balancing efficiency by 35\ and reduces response delay by 28\, thereby exhibiting a substantial optimization effect in comparison with conventional scalability solutions.

Multi-dimensional Autoscaling of Processing Services: A Comparison of Agent-based Methods

Artificial Intelligence

Makes computers work better with less power.

12 Jun 2025 0

89%

Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems

Information Retrieval

Makes online recommendations faster and better.

13 Jun 2025 0

88%

Intelligent Resource Allocation Optimization for Cloud Computing via Machine Learning

Distributed, Parallel, and Cluster Computing

Makes computer clouds work smarter and cheaper.

21 Mar 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

9 pages

Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling

Makes AI services faster and use less power.

Technical Abstract

Multi-dimensional Autoscaling of Processing Services: A Comparison of Agent-based Methods

Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems

Intelligent Resource Allocation Optimization for Cloud Computing via Machine Learning