GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models
By: Maanas Taneja, Purab Shingvi
Potential Business Impact:
Shrinks computer brain's memory use, making it faster.
The key-value (KV) cache in large language models presents a significant memory bottleneck during inference, growing linearly with sequence length and often exceeding the memory footprint of model weights themselves. We implement and evaluate GPU-accelerated INT8 quantization for KV cache compression, achieving 4$\times$ memory reduction with minimal accuracy degradation. We develop four CUDA kernel variants -- naive, tiled, coarsened, and vectorized -- and benchmark them across realistic workload sizes up to 1 billion elements. Our vectorized kernel achieves up to 1,694$\times$ speedup over CPU baselines while maintaining reconstruction error below 0.004 and attention score error below 0.1 even for 8K-dimensional heads. These results demonstrate that INT8 quantization provides a practical approach for reducing memory pressure in LLM inference with negligible computational overhead (6--58ms) and minimal impact on downstream model behavior
Similar Papers
Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models
Machine Learning (CS)
Makes AI remember more without using much memory.
KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
Machine Learning (CS)
Makes AI understand long texts faster.
Quantize What Counts: More for Keys, Less for Values
Machine Learning (CS)
Makes AI smarter using less computer memory.