Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
By: Hang Zhang , Jiuchen Shi , Yixiao Wang and more
Potential Business Impact:
Makes AI answer questions much faster.
Multiple Low-Rank Adapters (Multi-LoRAs) are gaining popularity for task-specific Large Language Model (LLM) applications. For multi-LoRA serving, caching hot KV caches and LoRA adapters in high bandwidth memory of accelerations can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dependencies when caching LoRAs and KVs. We therefore propose FASTLIBRA, a Multi-LoRA caching system to optimize the serving performance. FASTLIBRA comprises a dependency-aware cache manager and a performance-driven cache swapper. The cache manager maintains the usage dependencies between LoRAs and KV caches during the inference with a unified caching pool. The cache swapper determines the swap-in or out of LoRAs and KV caches based on a unified cost model, when the HBM is idle or busy, respectively. Experimental results show that ELORA reduces the TTFT by 63.4% on average, compared to state-of-the-art works.
Similar Papers
Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs
Distributed, Parallel, and Cluster Computing
Makes AI models answer questions much faster.
EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices
Distributed, Parallel, and Cluster Computing
Makes smart computer helpers work faster on phones.
Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems
Distributed, Parallel, and Cluster Computing
Makes AI models run faster using fewer computers.