Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices
By: Lingzhe Zhang , Tong Jia , Yunpeng Zhai and more
Potential Business Impact:
Finds computer problems faster by learning from past fixes.
As contemporary microservice systems become increasingly popular and complex-often comprising hundreds or even thousands of fine-grained, interdependent subsystems-they are experiencing more frequent failures. Ensuring system reliability thus demands accurate root cause localization. While many traditional graph-based and deep learning approaches have been explored for this task, they often rely heavily on pre-defined schemas that struggle to adapt to evolving operational contexts. Consequently, a number of LLM-based methods have recently been proposed. However, these methods still face two major limitations: shallow, symptom-centric reasoning that undermines accuracy, and a lack of cross-alert reuse that leads to redundant reasoning and high latency. In this paper, we conduct a comprehensive study of how Site Reliability Engineers (SREs) localize the root causes of failures, drawing insights from professionals across multiple organizations. Our investigation reveals that expert root cause analysis exhibits three key characteristics: recursiveness, multi-dimensional expansion, and cross-modal reasoning. Motivated by these findings, we introduce AMER-RCL, an agentic memory enhanced recursive reasoning framework for root cause localization in microservices. AMER-RCL employs the Recursive Reasoning RCL engine, a multi-agent framework that performs recursive reasoning on each alert to progressively refine candidate causes, while Agentic Memory incrementally accumulates and reuses reasoning from prior alerts within a time window to reduce redundant exploration and lower inference latency. Experimental results demonstrate that AMER-RCL consistently outperforms state-of-the-art methods in both localization accuracy and inference efficiency.
Similar Papers
Adaptive Root Cause Localization for Microservice Systems with Multi-Agent Recursion-of-Thought
Software Engineering
Finds computer problems faster by thinking like people.
The Multi-Agent Fault Localization System Based on Monte Carlo Tree Search Approach
Software Engineering
Finds computer problems faster and more accurately.
Root Cause Analysis for Microservice Systems via Cascaded Conditional Learning with Hypergraphs
Machine Learning (CS)
Finds computer problems faster by seeing how they spread.