Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents
By: Yueying Li, Jim Dai, Tianyi Peng
Potential Business Impact:
Makes AI understand requests faster and more reliably.
As demand for Large Language Models (LLMs) and AI agents rapidly grows, optimizing systems for efficient LLM inference becomes critical. While significant efforts have focused on system-level engineering, little is explored from a mathematical modeling and queuing perspective. In this paper, we aim to develop the queuing fundamentals for large language model (LLM) inference, bridging the gap between the queueing theory and LLM system communities. In particular, we study the throughput aspect in LLM inference systems. We prove that a large class of 'work-conserving' scheduling algorithms can achieve maximum throughput for individual inference LLM engine, highlighting 'work-conserving' as a key design principle in practice. In a network of LLM agents, work-conserving scheduling alone is insufficient, particularly when facing specific workload structures and multi-class workflows that require more sophisticated scheduling strategies. Evaluations of real-world systems show that Orca and Sarathi-serve are throughput-optimal, reassuring practitioners, while FasterTransformer and vanilla vLLM are not maximally stable and should be used with caution. Our results highlight the substantial benefits that the queueing community can offer in improving LLM inference systems and call for more interdisciplinary development.
Similar Papers
Optimal Scheduling Algorithms for LLM Inference: Theory and Practice
Machine Learning (CS)
Makes AI answer questions much faster.
High-Throughput LLM inference on Heterogeneous Clusters
Distributed, Parallel, and Cluster Computing
Makes AI answer questions much faster on different computers.
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints
Machine Learning (CS)
Makes AI answer questions faster with less memory.