A Decentralized Root Cause Localization Approach for Edge Computing Environments
By: Duneesha Fernando, Maria A. Rodriguez, Rajkumar Buyya
Potential Business Impact:
Finds the real problem in smart devices faster.
Edge computing environments host increasingly complex microservice-based IoT applications, which are prone to performance anomalies that can propagate across dependent services. Identifying the true source of such anomalies, known as Root Cause Localization (RCL), is essential for timely mitigation. However, existing RCL approaches are designed for cloud environments and rely on centralized analysis, which increases latency and communication overhead when applied at the edge. This paper proposes a decentralized RCL approach that executes localization directly at the edge device level using the Personalized PageRank (PPR) algorithm. The proposed method first groups microservices into communication- and colocation-aware clusters, thereby confining most anomaly propagation within cluster boundaries. Within each cluster, PPR is executed locally to identify the root cause, significantly reducing localization time. For the rare cases where anomalies propagate across clusters, we introduce an inter-cluster peer-to-peer approximation process, enabling lightweight coordination among clusters with minimal communication overhead. To enhance the accuracy of localization in heterogeneous edge environments, we also propose a novel anomaly scoring mechanism tailored to the diverse anomaly triggers that arise across microservice, device, and network layers. Evaluation results on the publicly available edge dataset, MicroCERCL, demonstrate that the proposed decentralized approach achieves comparable or higher localization accuracy than its centralized counterpart while reducing localization time by up to 34%. These findings highlight that decentralized graph-based RCL can provide a practical and efficient solution for anomaly diagnosis in resource-constrained edge environments.
Similar Papers
Adaptive Root Cause Localization for Microservice Systems with Multi-Agent Recursion-of-Thought
Software Engineering
Finds computer problems faster by thinking like people.
Root Cause Analysis for Microservice Systems via Cascaded Conditional Learning with Hypergraphs
Machine Learning (CS)
Finds computer problems faster by seeing how they spread.
Efficient Fault Localization in a Cloud Stack Using End-to-End Application Service Topology
Performance
Finds computer problems faster to fix them.