Score: 4

Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

Published: November 23, 2025 | arXiv ID: 2511.18643v1

By: Haojun Xia , Xiaoxia Wu , Jisen Li and more

BigTech Affiliations: Together AI Microsoft

Potential Business Impact:

Makes AI models use much less memory.

Business Areas:

A/B Testing Data and Analytics

The KV cache is a dominant memory bottleneck for LLM inference. While 4-bit KV quantization preserves accuracy, 2-bit often degrades it, especially on long-context reasoning. We close this gap via an algorithm-system co-design for mixed-precision KV caching: Kitty. On the algorithm side, extensive experiments show that Dynamic Channel-wise Precision Boost -- which ranks Key-cache channels by sensitivity and keeps only a small fraction at higher precision -- maintains near-zero loss in accuracy drop while approaching 2-bit memory. The main challenge is handling dynamic 4-bit channel boosts while keeping the page layout coalesced and the dequantization uniform, with no scattered reads or hard-coded masks. Kitty addresses these issues by decompose each mixed-precision Key page into two tensors with unified 2-bit precision. Based on this, Kitty provides a page-centric KV layout, Triton-compatible page dequantization kernels, and a lightweight runtime pipeline that preserves coalescing and avoids divergence. Across seven tasks and two model families (Qwen3, LLaMA3), Kitty cuts KV memory by nearly 8x with negligible accuracy loss, enabling up to 8x larger batches and 2.1x-4.1x higher throughput under the same memory budget. We release the full implementation of Kitty at https://github.com/Summer-Summer/Kitty.

MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning

Machine Learning (CS)

Makes AI think better using less computer memory.

22 Dec 2025 0

90%

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Machine Learning (CS)

Makes AI understand long texts faster.

6 Feb 2025 2

90%

Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models

CV and Pattern Recognition

Makes AI watch videos using less computer memory.

20 Mar 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

12 pages

Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

Makes AI models use much less memory.

Technical Abstract

MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models