Score: 0

WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

Published: December 10, 2025 | arXiv ID: 2512.09472v1

By: Chiheng Lou , Sheng Qi , Rui Kang and more

Deploying multiple models within shared GPU clusters is promising for improving resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance, especially time-to-first-token (TTFT). We identify the root cause of such compromise as their unawareness of future workload characteristics. In contrast, recent analysis on real-world traces has shown the high periodicity and long-term predictability of LLM serving workloads. We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads. Based on universal GPU workers, we design and build WarmServe, a multi-LLM serving system that (1) mitigates cluster-wide prewarming interference by adopting an evict-aware model placement strategy, (2) prepares universal GPU workers in advance by proactive prewarming, and (3) manages GPU memory with a zero-overhead memory switching mechanism. Evaluation under real-world datasets shows that WarmServe improves TTFT by up to 50.8$\times$ compared to the state-of-the-art autoscaling-based system, while being capable of serving up to 2.5$\times$ more requests compared to the GPU-sharing system.

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

Machine Learning (CS)

Makes AI models answer questions much faster.

6 Nov 2025 1

87%

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Distributed, Parallel, and Cluster Computing

Makes AI models run cheaper and faster.

6 May 2025 0

87%

Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity

Distributed, Parallel, and Cluster Computing

Makes AI models load much faster for users.

1 Dec 2025 2

View PDF Login to Bookmark

WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

Technical Abstract

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity