UniSage: A Unified and Post-Analysis-Aware Sampling for Microservices
By: Zhouruixing Zhu , Zhihan Jiang , Tianyi Yang and more
Potential Business Impact:
Finds computer problems faster by saving important data.
Traces and logs are essential for observability and fault diagnosis in modern distributed systems. However, their ever-growing volume introduces substantial storage overhead and complicates troubleshooting. Existing approaches typically adopt a sample-before-analysis paradigm: even when guided by data heuristics, they inevitably discard failure-related information and hinder transparency in diagnosing system behavior. To address this, we introduce UniSage, the first unified framework to sample both traces and logs using a post-analysis-aware paradigm. Instead of discarding data upfront, UniSagefirst performs lightweight and multi-modal anomaly detection and root cause analysis (RCA) on the complete data stream. This process yields fine-grained, service-level diagnostic insights that guide a dual-pillar sampling strategy for handling both normal and anomalous scenarios: an analysis-guided sampler prioritizes data implicated by RCA, while an edge-case-based sampler ensures rare but critical behaviors are captured. Together, these pillars ensure comprehensive coverage of critical signals without excessive redundancy. Extensive experiments demonstrate that UniSage significantly outperforms state-of-the-art baselines. At a 2.5% sampling rate, it captures 56.5% of critical traces and 96.25% of relevant logs, while improving the accuracy (AC@1) of downstream root cause analysis by 42.45%. Furthermore, its efficient pipeline processes 10 minutes of telemetry data in under 5 seconds, demonstrating its practicality for production environments.
Similar Papers
A Two-Staged LLM-Based Framework for CI/CD Failure Detection and Remediation with Industrial Validation
Software Engineering
Fixes computer code errors automatically.
Trace Sampling 2.0: Code Knowledge Enhanced Span-level Sampling for Distributed Tracing
Software Engineering
Keeps all computer logs, saves lots of space.
A Lightweight Framework for Trigger-Guided LoRA-Based Self-Adaptation in LLMs
Computation and Language
Lets AI learn new things while solving problems.