Harli: SLO-Aware Co-location of LLM Inference and PEFT-based Finetuning on Model-as-a-Service Platforms
By: Ao Xu , Han Zhao , Weihao Cui and more
Potential Business Impact:
Makes AI models work faster and use less power.
Large language models (LLMs) are increasingly deployed under the Model-as-a-Service (MaaS) paradigm. To meet stringent quality-of-service (QoS) requirements, existing LLM serving systems disaggregate the prefill and decode phases of inference. However, decode instances often experience low GPU utilization due to their memory-bound nature and insufficient batching in dynamic workloads, leaving compute resources underutilized. We introduce Harli, a serving system that improves GPU utilization by co-locating parameter-efficient finetuning (PEFT) tasks with LLM decode instances. PEFT tasks are compute-bound and memory-efficient, making them ideal candidates for safe co-location. Specifically, Harli addresses key challenges--limited memory and unpredictable interference--using three components: a unified memory allocator for runtime memory reuse, a two-stage latency predictor for decode latency modeling, and a QoS-guaranteed throughput-maximizing scheduler for throughput maximization. Experimental results show that Harli improves the finetune throughput by 46.2% on average (up to 92.0%) over state-of-the-art serving systems, while maintaining strict QoS guarantees for inference decode.
Similar Papers
Harli: Harvest Underutilized Resources in LLM Serving with Finetuning Tasks
Distributed, Parallel, and Cluster Computing
Lets AI do more work on one computer.
SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference
Distributed, Parallel, and Cluster Computing
Makes many AI models run faster together.
Predictable LLM Serving on GPU Clusters
Distributed, Parallel, and Cluster Computing
Makes computer programs run faster on shared machines.