Model Discovery and Graph Simulation: A Lightweight Alternative to Chaos Engineering
By: Anatoly A. Krasnovsky, Alexander Zorkin
Potential Business Impact:
Finds app problems before they break.
Microservice applications are prone to cascading failures because of dense inter-service dependencies. Ensuring resilience usually demands fault-injection experiments in production-like setups. We propose \textit{model discovery} -- an automated CI/CD step that extracts a live dependency graph from trace data -- and show that this lightweight representation is sufficient for accurate resilience prediction. Using the DeathStarBench Social Network, we build the graph, simulate failures via Monte-Carlo, and run matching chaos experiments on the real system. The graph model closely matches reality: with no replication, 16 trials yield an observed resilience of 0.186 versus a predicted 0.161; with replication, both observed and predicted values converge to 0.305 (mean absolute error \leq 0.0004). These results indicate that even a simple, automatically discovered graph can estimate microservice availability with high fidelity, offering rapid design-time insight without full-scale failure testing.
Similar Papers
Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo
Software Engineering
Finds how likely online services will fail.
SMART: A Surrogate Model for Predicting Application Runtime in Dragonfly Systems
Machine Learning (CS)
Predicts computer network slowdowns accurately.
HGraphScale: Hierarchical Graph Learning for Autoscaling Microservice Applications in Container-based Cloud Computing
Distributed, Parallel, and Cluster Computing
Makes apps run faster by smartly using computer power.