Score: 2

LogPilot: Intent-aware and Scalable Alert Diagnosis for Large-scale Online Service Systems

Published: September 30, 2025 | arXiv ID: 2509.25874v1

By: Zhihan Jiang , Jinyang Liu , Yichen Li and more

BigTech Affiliations: ByteDance

Potential Business Impact:

Finds computer problems faster by reading logs.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Effective alert diagnosis is essential for ensuring the reliability of large-scale online service systems. However, on-call engineers are often burdened with manually inspecting massive volumes of logs to identify root causes. While various automated tools have been proposed, they struggle in practice due to alert-agnostic log scoping and the inability to organize complex data effectively for reasoning. To overcome these limitations, we introduce LogPilot, an intent-aware and scalable framework powered by Large Language Models (LLMs) for automated log-based alert diagnosis. LogPilot introduces an intent-aware approach, interpreting the logic in alert definitions (e.g., PromQL) to precisely identify causally related logs and requests. To achieve scalability, it reconstructs each request's execution into a spatiotemporal log chain, clusters similar chains to identify recurring execution patterns, and provides representative samples to the LLMs for diagnosis. This clustering-based approach ensures the input is both rich in diagnostic detail and compact enough to fit within the LLM's context window. Evaluated on real-world alerts from Volcano Engine Cloud, LogPilot improves the usefulness of root cause summarization by 50.34% and exact localization accuracy by 54.79% over state-of-the-art methods. With a diagnosis time under one minute and a cost of only $0.074 per alert, LogPilot has been successfully deployed in production, offering an automated and practical solution for service alert diagnosis.

Country of Origin
🇨🇳 China

Page Count
13 pages

Category
Computer Science:
Software Engineering