Research on fault diagnosis and root cause analysis based on full stack observability
By: Jian Hou
Potential Business Impact:
Finds computer problems faster and explains why.
With the rapid development of cloud computing and ultra-large-scale data centers, the scale and complexity of systems have increased significantly, leading to frequent faults that often show cascading propagation. How to achieve efficient, accurate, and interpretable Root Cause Analysis (RCA) based on observability data (metrics, logs, traces) has become a core issue in AIOps. This paper reviews two mainstream research threads in top conferences and journals over the past five years: FaultInsight[1] focusing on dynamic causal discovery and HolisticRCA[2] focusing on multi-modal/cross-level fusion, and analyzes the advantages and disadvantages of existing methods. A KylinRCA framework integrating the ideas of both is proposed, which depicts the propagation chain through temporal causal discovery, realizes global root cause localization and type identification through cross-modal graph learning, and outputs auditable evidence chains combined with mask-based explanation methods. A multi-dimensional experimental scheme is designed, evaluation indicators are clarified, and engineering challenges are discussed, providing an effective solution for fault diagnosis under full-stack observability.
Similar Papers
DynaCausal: Dynamic Causality-Aware Root Cause Analysis for Distributed Microservices
Software Engineering
Finds the real reason computer problems happen faster.
Efficient Fault Localization in a Cloud Stack Using End-to-End Application Service Topology
Performance
Finds computer problems faster to fix them.
A Goal-Driven Survey on Root Cause Analysis
Software Engineering
Helps fix computer problems faster by understanding their causes.