Score: 2

Streamlining Resilient Kubernetes Autoscaling with Multi-Agent Systems via an Automated Online Design Framework

Published: May 26, 2025 | arXiv ID: 2505.21559v1

By: Julien Soulé , Jean-Paul Jamont , Michel Occello and more

BigTech Affiliations: Thales

Potential Business Impact:

Keeps computer systems running even when attacked.

Business Areas:

PaaS Software

In cloud-native systems, Kubernetes clusters with interdependent services often face challenges to their operational resilience due to poor workload management issues such as resource blocking, bottlenecks, or continuous pod crashes. These vulnerabilities are further amplified in adversarial scenarios, such as Distributed Denial-of-Service attacks (DDoS). Conventional Horizontal Pod Autoscaling (HPA) approaches struggle to address such dynamic conditions, while reinforcement learning-based methods, though more adaptable, typically optimize single goals like latency or resource usage, neglecting broader failure scenarios. We propose decomposing the overarching goal of maintaining operational resilience into failure-specific sub-goals delegated to collaborative agents, collectively forming an HPA Multi-Agent System (MAS). We introduce an automated, four-phase online framework for HPA MAS design: 1) modeling a digital twin built from cluster traces; 2) training agents in simulation using roles and missions tailored to failure contexts; 3) analyzing agent behaviors for explainability; and 4) transferring learned policies to the real cluster. Experimental results demonstrate that the generated HPA MASs outperform three state-of-the-art HPA systems in sustaining operational resilience under various adversarial conditions in a proposed complex cluster.

Resilient Auto-Scaling of Microservice Architectures with Efficient Resource Management

Distributed, Parallel, and Cluster Computing

Keeps apps running smoothly during computer problems.

6 Jun 2025 1

89%

Collaborative Multi-Agent Reinforcement Learning Approach for Elastic Cloud Resource Scaling

Distributed, Parallel, and Cluster Computing

Makes cloud computers adjust power automatically.

1 Jul 2025 0

89%

AutoMaAS: Self-Evolving Multi-Agent Architecture Search for Large Language Models

Artificial Intelligence

Builds smarter AI teams that work better and cheaper.

3 Oct 2025 1

View PDF Login to Bookmark

Country of Origin

🇫🇷 France

Page Count

11 pages

Streamlining Resilient Kubernetes Autoscaling with Multi-Agent Systems via an Automated Online Design Framework

Keeps computer systems running even when attacked.

Technical Abstract

Resilient Auto-Scaling of Microservice Architectures with Efficient Resource Management

Collaborative Multi-Agent Reinforcement Learning Approach for Elastic Cloud Resource Scaling

AutoMaAS: Self-Evolving Multi-Agent Architecture Search for Large Language Models