LogPilot: Intent-aware and Scalable Alert Diagnosis for Large-scale Online Service Systems
By: Zhihan Jiang , Jinyang Liu , Yichen Li and more
Potential Business Impact:
Finds computer problems faster by reading logs.
Effective alert diagnosis is essential for ensuring the reliability of large-scale online service systems. However, on-call engineers are often burdened with manually inspecting massive volumes of logs to identify root causes. While various automated tools have been proposed, they struggle in practice due to alert-agnostic log scoping and the inability to organize complex data effectively for reasoning. To overcome these limitations, we introduce LogPilot, an intent-aware and scalable framework powered by Large Language Models (LLMs) for automated log-based alert diagnosis. LogPilot introduces an intent-aware approach, interpreting the logic in alert definitions (e.g., PromQL) to precisely identify causally related logs and requests. To achieve scalability, it reconstructs each request's execution into a spatiotemporal log chain, clusters similar chains to identify recurring execution patterns, and provides representative samples to the LLMs for diagnosis. This clustering-based approach ensures the input is both rich in diagnostic detail and compact enough to fit within the LLM's context window. Evaluated on real-world alerts from Volcano Engine Cloud, LogPilot improves the usefulness of root cause summarization by 50.34% and exact localization accuracy by 54.79% over state-of-the-art methods. With a diagnosis time under one minute and a cost of only $0.074 per alert, LogPilot has been successfully deployed in production, offering an automated and practical solution for service alert diagnosis.
Similar Papers
RulePilot: An LLM-Powered Agent for Security Rule Generation
Cryptography and Security
Automates security rules, saving experts time.
Scalable and Efficient Large-Scale Log Analysis with LLMs: An IT Software Support Case Study
Software Engineering
Finds computer problems faster, saving money.
PromptPilot: Improving Human-AI Collaboration Through LLM-Enhanced Prompt Engineering
Human-Computer Interaction
Helps people get better answers from AI.