Score: 0

LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

Published: December 18, 2025 | arXiv ID: 2512.16843v1

By: Harsh Vardhan Bansal

Potential Business Impact:

Makes AI answer questions much faster.

Business Areas:

Semantic Search Internet Services

Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1 X speedup in inference time with <0.5% accuracy degradation. Our results highlight LLMCache as a practical and general-purpose solution for optimizing transformer inference in real-world applications

VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference

CV and Pattern Recognition

Saves computer power by remembering past work.

15 Dec 2025 0

90%

AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache

Computation and Language

Makes AI understand text much faster.

29 Oct 2025 1

90%

AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache

Computation and Language

Makes AI understand text much faster.

29 Oct 2025 1

View PDF Login to Bookmark

Page Count

6 pages

LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

Makes AI answer questions much faster.

Technical Abstract

VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference

AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache

AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache