Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications
By: Shengkun Cui , Rahul Krishna , Saurabh Jha and more
Potential Business Impact:
Fixes cloud problems faster by checking code and connections.
Cloud incidents pose major operational challenges in production, with unresolved production cloud incidents cost on average over $2M per hour. Prior research identifies code- and configuration-related issues as the predominant category of root causes in cloud incidents. This paper introduces PRAXIS, an orchestrator that manages and deploys an agentic workflow for diagnosing code- and configuration-caused cloud incidents. PRAXIS employs an LLM-driven structured traversal over two types of graph: (1) a service dependency graph (SDG) that captures microservice-level dependencies; and (2) a hammock-block program dependence graph (PDG) that captures code-level dependencies for each microservice. Together, these graphs encode microservice- and code-level dependencies and the LLM acts as a traversal policy over these graphs, moving between services and code dependencies to localize and explain failures. Compared to state-of-the-art ReAct baselines, PRAXIS improves RCA accuracy by up to 3.1x while reducing token consumption by 3.8x. PRAXIS is demonstrated on a set of 30 comprehensive real-world incidents that is being compiled into an RCA benchmark.
Similar Papers
ErrorPrism: Reconstructing Error Propagation Paths in Cloud Service Systems
Software Engineering
Finds the real reason computer problems happen.
A Decentralized Root Cause Localization Approach for Edge Computing Environments
Distributed, Parallel, and Cluster Computing
Finds the real problem in smart devices faster.
Simplifying Root Cause Analysis in Kubernetes with StateGraph and LLM
Distributed, Parallel, and Cluster Computing
Finds computer problems faster and more accurately.