LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure
By: Jaehong Cho, Hyunmin Choi, Jongse Park
Potential Business Impact:
Tests how to make AI answer questions faster.
This paper introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5x fewer LoC and outperforms the predecessor's hardware-simulator integration, demonstrating LLMServingSim2.0's low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.
Similar Papers
Simulating LLM training workloads for heterogeneous compute and network infrastructure
Distributed, Parallel, and Cluster Computing
Makes AI training faster on mixed computer parts.
TokenSim: Enabling Hardware and Software Exploration for Large Language Model Inference Systems
Distributed, Parallel, and Cluster Computing
Makes AI answer questions much faster and cheaper.
From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs
Hardware Architecture
Makes AI understand faster on special chips.