Resilient Microservices: A Systematic Review of Recovery Patterns, Strategies, and Evaluation Frameworks
By: Muzeeb Mohammad
Microservice based systems underpin modern distributed computing environments but remain vulnerable to partial failures, cascading timeouts, and inconsistent recovery behavior. Although numerous resilience and recovery patterns have been proposed, existing surveys are largely descriptive and lack systematic evidence synthesis or quantitative rigor. This paper presents a PRISMA aligned systematic literature review of empirical studies on microservice recovery strategies published between 2014 and 2025 across IEEE Xplore, ACM Digital Library, and Scopus. From an initial corpus of 412 records, 26 high quality studies were selected using transparent inclusion, exclusion, and quality assessment criteria. The review identifies nine recurring resilience themes encompassing circuit breakers, retries with jitter and budgets, sagas with compensation, idempotency, bulkheads, adaptive backpressure, observability, and chaos validation. As a data oriented contribution, the paper introduces a Recovery Pattern Taxonomy, a Resilience Evaluation Score checklist for standardized benchmarking, and a constraint aware decision matrix mapping latency, consistency, and cost trade offs to appropriate recovery mechanisms. The results consolidate fragmented resilience research into a structured and analyzable evidence base that supports reproducible evaluation and informed design of fault tolerant and performance aware microservice systems.
Similar Papers
A Survey on the Landscape of Self-adaptive Cloud Design and Operations Patterns: Goals, Strategies, Tooling, Evaluation and Dataset Perspectives
Distributed, Parallel, and Cluster Computing
Makes apps automatically fix themselves when problems arise.
Key Considerations for Auto-Scaling: Lessons from Benchmark Microservices
Software Engineering
Helps apps automatically adjust to busy times.
SoK: Microservice Architectures from a Dependability Perspective
Distributed, Parallel, and Cluster Computing
Finds and fixes computer program problems faster.