RetryGuard: Preventing Self-Inflicted Retry Storms in Cloud Microservices Applications

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cloud-native microservices—due to their heterogeneity and automated scaling—often trigger cross-service retry storms, causing resource contention, severe latency spikes, and self-inflicted “Denial of Wallet” (DoW) failures. To address this, we propose RetryGuard, a distributed retry governance framework that introduces the first analytical-model-based, cross-service retry policy coordination mechanism. It enables parallel, service-level modeling of retry behavior and real-time, adaptive decision-making, dynamically balancing throughput, latency, and cost. RetryGuard integrates natively with Kubernetes and Istio service mesh, supporting scalable online policy optimization. Experimental evaluation demonstrates that, compared to AWS’s standard and advanced retry policies, RetryGuard reduces resource consumption by up to 42%, lowers operational costs by 37%, and maintains high stability and linear scalability under complex service topologies.

Technology Category

Application Category

📝 Abstract
Modern cloud applications are built on independent, diverse microservices, offering scalability, flexibility, and usage-based billing. However, the structural design of these varied services, along with their reliance on auto-scalers for dynamic internet traffic, introduces significant coordination challenges. As we demonstrate in this paper, common default retry patterns used between misaligned services can turn into retry storms which drive up resource usage and costs, leading to self-inflicted Denial-of-Wallet (DoW) scenarios. To overcome these problems we introduce RetryGuard, a distributed framework for productive control of retry patterns across interdependent microservices. By managing retry policy on a per-service basis and making parallel decisions, RetryGuard prevents retry storms, curbs resource contention, and mitigates escalating operational costs. RetryGuard makes its decisions based on an analytic model that captures the relationships among retries, throughput (rejections), delays, and costs. Experimental results show that RetryGuard significantly reduces resource usage and costs compared to AWS standard and advanced retry policies. We further demonstrate its scalability and superior performance in a more complex Kubernetes deployment with the Istio service mesh, where it achieves substantial improvements.
Problem

Research questions and friction points this paper is trying to address.

Preventing retry storms in cloud microservices applications
Managing resource contention and escalating operational costs
Controlling retry patterns across interdependent microservices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed framework controls retry patterns across microservices
Per-service retry policy management prevents retry storms
Analytic model optimizes retry decisions to reduce costs
🔎 Similar Papers
No similar papers found.
J
Jhonatan Tavori
Blavatnik School of Computer Science and AI, Tel Aviv University
Anat Bremler-Barr
Anat Bremler-Barr
Professor, Tel-Aviv University
Computer NetworksNetwork SecurityIoT SecurityDNS SecurityDDoS
H
Hanoch Levy
Blavatnik School of Computer Science and AI, Tel Aviv University
O
Ofek Lavi
Blavatnik School of Computer Science and AI, Tel Aviv University