Score: 0

Research on fault diagnosis and root cause analysis based on full stack observability

Published: September 8, 2025 | arXiv ID: 2509.12231v1

By: Jian Hou

Potential Business Impact:

Finds computer problems faster and explains why.

Business Areas:
Big Data Data and Analytics

With the rapid development of cloud computing and ultra-large-scale data centers, the scale and complexity of systems have increased significantly, leading to frequent faults that often show cascading propagation. How to achieve efficient, accurate, and interpretable Root Cause Analysis (RCA) based on observability data (metrics, logs, traces) has become a core issue in AIOps. This paper reviews two mainstream research threads in top conferences and journals over the past five years: FaultInsight[1] focusing on dynamic causal discovery and HolisticRCA[2] focusing on multi-modal/cross-level fusion, and analyzes the advantages and disadvantages of existing methods. A KylinRCA framework integrating the ideas of both is proposed, which depicts the propagation chain through temporal causal discovery, realizes global root cause localization and type identification through cross-modal graph learning, and outputs auditable evidence chains combined with mask-based explanation methods. A multi-dimensional experimental scheme is designed, evaluation indicators are clarified, and engineering challenges are discussed, providing an effective solution for fault diagnosis under full-stack observability.

Country of Origin
🇨🇳 China

Page Count
27 pages

Category
Computer Science:
Distributed, Parallel, and Cluster Computing