Score: 1

HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse

Published: April 3, 2025 | arXiv ID: 2504.02921v1

By: Yuwei An , Yihua Cheng , Seo Jin Park and more

Potential Business Impact:

Makes AI smarter and faster by reusing old thoughts.

Business Areas:

Semantic Search Internet Services

Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) by integrating external knowledge into the generation process. A key component of RAG pipelines is the reranker, which selects the most relevant documents from a pool of retrieved candidates and significantly improves the quality of the generated responses. While rerankers refine the selection of retrieved documents in RAG pipelines, they introduce computational challenges that hinder high throughput and low latency. To address this problem, we propose HyperRAG, a system that optimizes the trade-off between quality and efficiency in RAG pipelines by leveraging KV-cache reuse for efficient reranker inference. By reusing document-side KV-cache, HyperRAG achieves both high-quality generation and system-level efficiency. To fully realize the benefits of KV-cache reuse, HyperRAG incorporates a range of system-level optimizations designed to enhance efficiency and scalability. Experiments show that HyperRAG achieves a 2 - 3 throughput improvement with decoder-only rerankers while also delivering higher downstream performance compared with traditional RAG service.

Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

Computation and Language

Makes AI remember more information faster.

6 Mar 2025 3

92%

DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

Computation and Language

Helps AI pick the best facts for answers.

12 May 2025 2

92%

KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering

Computation and Language

Helps AI answer questions more accurately using more facts.

5 Sep 2025 4

View PDF Login to Bookmark

Page Count

15 pages

HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse

Makes AI smarter and faster by reusing old thoughts.

Technical Abstract

Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering