LatencyPrism: Online Non-intrusive Latency Sculpting for SLO-Guaranteed LLM Inference
By: Du Yin , Jiayi Ren , Xiayu Sun and more
LLM inference latency critically determines user experience and operational costs, directly impacting throughput under SLO constraints. Even brief latency spikes degrade service quality despite acceptable average performance. However, distributed inference environments featuring diverse software frameworks and XPU architectures combined with dynamic workloads make latency analysis challenging. Constrained by intrusive designs that necessitate service restarts or even suspension, and by hardware-bound implementations that fail to adapt to heterogeneous inference environments, existing AI profiling methods are often inadequate for real-time production analysis. We present LatencyPrism, the first zero-intrusion multi-platform latency sculpting system. It aims to break down the inference latency across pipeline, proactively alert on inference latency anomalies, and guarantee adherence to SLOs, all without requiring code modifications or service restarts. LatencyPrism has been deployed across thousands of XPUs for over six months. It enables low-overhead real-time monitoring at batch level with alerts triggered in milliseconds. This approach distinguishes between workload-driven latency variations and anomalies indicating underlying issues with an F1-score of 0.98. We also conduct extensive experiments and investigations into root cause analysis to demonstrate LatencyPrism's capability.
Similar Papers
A Predictive and Synergistic Two-Layer Scheduling Framework for LLM Serving
Distributed, Parallel, and Cluster Computing
Makes AI answer questions faster and more reliably.
PRISM: Privacy-Aware Routing for Adaptive Cloud-Edge LLM Inference via Semantic Sketch Collaboration
Cryptography and Security
Keeps private info safe while using AI.
Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
Distributed, Parallel, and Cluster Computing
Makes AI models run cheaper and faster.