Score: 0

Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo

Published: December 13, 2025 | arXiv ID: 2512.12314v1

By: Anatoly A. Krasnovsky

While distributed tracing and chaos engineering are becoming standard for microservices, resilience models remain largely manual and bespoke. We revisit a trace-discovered connectivity model that derives a service dependency graph from traces and uses Monte Carlo simulation to estimate endpoint availability under fail-stop service failures. Compared to earlier work, we (i) derive the graph directly from raw OpenTelemetry traces, (ii) attach endpoint-specific success predicates, and (iii) add a simple asynchronous semantics that treats Kafka edges as non-blocking for immediate HTTP success. We apply this model to the OpenTelemetry Demo ("Astronomy Shop") using a GitHub Actions workflow that discovers the graph, runs simulations, and executes chaos experiments that randomly kill microservices in a Docker Compose deployment. Across the studied failure fractions, the model reproduces the overall availability degradation curve, while asynchronous semantics for Kafka edges change predicted availabilities by at most about 10^(-5) (0.001 percentage points). This null result suggests that for immediate HTTP availability in this case study, explicitly modeling asynchronous dependencies is not warranted, and a simpler connectivity-only model is sufficient.

Model Discovery and Graph Simulation: A Lightweight Alternative to Chaos Engineering

Software Engineering

Finds app problems before they break.

12 Jun 2025 1

84%

Lessons from a Big-Bang Integration: Challenges in Edge Computing and Machine Learning

Software Engineering

Fixes teamwork problems for big tech projects.

23 Jul 2025 0

84%

Trace Sampling 2.0: Code Knowledge Enhanced Span-level Sampling for Distributed Tracing

Software Engineering

Keeps all computer logs, saves lots of space.

17 Sep 2025 0

View PDF Login to Bookmark

Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo

Technical Abstract

Model Discovery and Graph Simulation: A Lightweight Alternative to Chaos Engineering

Lessons from a Big-Bang Integration: Challenges in Edge Computing and Machine Learning

Trace Sampling 2.0: Code Knowledge Enhanced Span-level Sampling for Distributed Tracing