Score: 1

Harli: Harvest Underutilized Resources in LLM Serving with Finetuning Tasks

Published: November 13, 2025 | arXiv ID: 2511.11729v1

By: Ao Xu , Han Zhao , Weihao Cui and more

Potential Business Impact:

Lets AI do more work on one computer.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) are increasingly deployed under the Model-as-a-Service (MaaS) paradigm. To meet stringent quality-of-service (QoS) requirements, existing LLM serving systems disaggregate the prefill and decode phases of inference. However, decode instances often experience low GPU utilization due to their memory-bound nature and insufficient batching in dynamic workloads, leaving compute resources underutilized. We introduce Harli, a serving system that improves GPU utilization by co-locating parameter-efficient finetuning (PEFT) tasks with LLM decode instances. PEFT tasks are compute-bound and memory-efficient, making them ideal candidates for safe co-location. Specifically, Harli addresses key challenges--limited memory and unpredictable interference--using three components: a unified memory allocator for runtime memory reuse, a two-stage latency predictor for decode latency modeling, and a QoS-guaranteed throughput-maximizing scheduler for throughput maximization. Experimental results show that Harli improves the finetune throughput by 46.2% on average (up to 92.0%) over state-of-the-art serving systems, while maintaining strict QoS guarantees for inference decode.

Harli: SLO-Aware Co-location of LLM Inference and PEFT-based Finetuning on Model-as-a-Service Platforms

Distributed, Parallel, and Cluster Computing

Makes AI models work faster and use less power.

13 Nov 2025 1

89%

Hetis: Serving LLMs in Heterogeneous GPU Clusters with Fine-grained and Dynamic Parallelism

Distributed, Parallel, and Cluster Computing

Makes AI models run faster on different computers.

10 Sep 2025 1

88%

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

Distributed, Parallel, and Cluster Computing

Trains AI faster on different kinds of computers.

13 Dec 2025 3

View PDF Login to Bookmark

Page Count

14 pages

Harli: Harvest Underutilized Resources in LLM Serving with Finetuning Tasks

Lets AI do more work on one computer.

Technical Abstract

Harli: SLO-Aware Co-location of LLM Inference and PEFT-based Finetuning on Model-as-a-Service Platforms

Hetis: Serving LLMs in Heterogeneous GPU Clusters with Fine-grained and Dynamic Parallelism

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments