PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression
By: Bo Jiang , Taolue Yang , Youyuan Liu and more
Potential Business Impact:
Makes AI understand long texts using less memory.
Transformer-based large language models (LLMs) have demonstrated remarkable potential across a wide range of practical applications. However, long-context inference remains a significant challenge due to the substantial memory requirements of the key-value (KV) cache, which can scale to several gigabytes as sequence length and batch size increase. In this paper, we present \textbf{PackKV}, a generic and efficient KV cache management framework optimized for long-context generation. %, which synergistically supports both latency-critical and throughput-critical inference scenarios. PackKV introduces novel lossy compression techniques specifically tailored to the characteristics of KV cache data, featuring a careful co-design of compression algorithms and system architecture. Our approach is compatible with the dynamically growing nature of the KV cache while preserving high computational efficiency. Experimental results show that, under the same and minimum accuracy drop as state-of-the-art quantization methods, PackKV achieves, on average, \textbf{153.2}\% higher memory reduction rate for the K cache and \textbf{179.6}\% for the V cache. Furthermore, PackKV delivers extremely high execution throughput, effectively eliminating decompression overhead and accelerating the matrix-vector multiplication operation. Specifically, PackKV achieves an average throughput improvement of \textbf{75.7}\% for K and \textbf{171.7}\% for V across A100 and RTX Pro 6000 GPUs, compared to cuBLAS matrix-vector multiplication kernels, while demanding less GPU memory bandwidth. Code available on https://github.com/BoJiang03/PackKV
Similar Papers
KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache
Distributed, Parallel, and Cluster Computing
Makes AI understand long stories with less memory.
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
Computation and Language
Makes AI understand long stories better, faster.
KV Cache Compression for Inference Efficiency in LLMs: A Review
Distributed, Parallel, and Cluster Computing
Makes AI smarter and faster using less memory.