Streamlining Resilient Kubernetes Autoscaling with Multi-Agent Systems via an Automated Online Design Framework

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Kubernetes clusters suffer degraded operational resilience under adversarial scenarios—such as DDoS attacks—due to resource congestion, bottlenecks, and persistent Pod crashes. Method: This paper proposes HPA MAS, a multi-agent horizontal pod autoscaling framework. Its core design decomposes operational resilience into fault-specific sub-objectives, optimized distributively by collaborative agents. The framework integrates four synergistic stages: digital twin modeling, role- and task-driven multi-agent reinforcement learning, simulation-to-reality policy transfer, and runtime trajectory analysis—forming an explainable, online adaptive architecture. Contribution/Results: Extensive experiments demonstrate that HPA MAS significantly outperforms three state-of-the-art HPA approaches across diverse adversarial workloads. It effectively maintains service availability and enhances system-level operational resilience while providing interpretable, real-time adaptation.

Technology Category

Application Category

📝 Abstract
In cloud-native systems, Kubernetes clusters with interdependent services often face challenges to their operational resilience due to poor workload management issues such as resource blocking, bottlenecks, or continuous pod crashes. These vulnerabilities are further amplified in adversarial scenarios, such as Distributed Denial-of-Service attacks (DDoS). Conventional Horizontal Pod Autoscaling (HPA) approaches struggle to address such dynamic conditions, while reinforcement learning-based methods, though more adaptable, typically optimize single goals like latency or resource usage, neglecting broader failure scenarios. We propose decomposing the overarching goal of maintaining operational resilience into failure-specific sub-goals delegated to collaborative agents, collectively forming an HPA Multi-Agent System (MAS). We introduce an automated, four-phase online framework for HPA MAS design: 1) modeling a digital twin built from cluster traces; 2) training agents in simulation using roles and missions tailored to failure contexts; 3) analyzing agent behaviors for explainability; and 4) transferring learned policies to the real cluster. Experimental results demonstrate that the generated HPA MASs outperform three state-of-the-art HPA systems in sustaining operational resilience under various adversarial conditions in a proposed complex cluster.
Problem

Research questions and friction points this paper is trying to address.

Improving Kubernetes resilience against workload failures and attacks
Overcoming limitations of traditional autoscaling in dynamic conditions
Designing multi-agent systems for robust pod autoscaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent System for resilient Kubernetes autoscaling
Automated online framework with four design phases
Digital twin modeling from cluster traces for training
🔎 Similar Papers
No similar papers found.
J
Julien Soul'e
Thales Land and Air Systems, BU IAS, Univ. Grenoble Alpes, Grenoble INP, LCIS, 26000, Valence, France
Jean-Paul Jamont
Jean-Paul Jamont
Univ. Grenoble Alpes - IUT de Valence
Multiagent systemsCollective cyber-physical systemsSelf-organizationCooperative embedded systemsIoT/WoT
M
M. Occello
Univ. Grenoble Alpes, Grenoble INP, LCIS, 26000, Valence, France
Louis-Marie Traonouez
Louis-Marie Traonouez
INRIA Rennes
P
Paul Th'eron
AICA IWG, La Guillermie, France