Score: 0

V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval

Published: December 13, 2025 | arXiv ID: 2512.12284v1

By: Donghyuk Kim , Sejeong Yang , Wonjin Shin and more

Potential Business Impact:

Makes videos work fast on small devices.

Business Areas:

Video Streaming Content and Publishing, Media and Entertainment, Video

Streaming video large language models (LLMs) are increasingly used for real-time multimodal tasks such as video captioning, question answering, conversational agents, and augmented reality. However, these models face fundamental memory and computational challenges because their key-value (KV) caches grow substantially with continuous streaming video input. This process requires an iterative prefill stage, which is a unique feature of streaming video LLMs. Due to its iterative prefill stage, it suffers from significant limitations, including extensive computation, substantial data transfer, and degradation in accuracy. Crucially, this issue is exacerbated for edge deployment, which is the primary target for these models. In this work, we propose V-Rex, the first software-hardware co-designed accelerator that comprehensively addresses both algorithmic and hardware bottlenecks in streaming video LLM inference. At its core, V-Rex introduces ReSV, a training-free dynamic KV cache retrieval algorithm. ReSV exploits temporal and spatial similarity-based token clustering to reduce excessive KV cache memory across video frames. To fully realize these algorithmic benefits, V-Rex offers a compact, low-latency hardware accelerator with a dynamic KV cache retrieval engine (DRE), featuring bit-level and early-exit based computing units. V-Rex achieves unprecedented real-time of 3.9-8.3 FPS and energy-efficient streaming video LLM inference on edge deployment with negligible accuracy loss. While DRE only accounts for 2.2% power and 2.0% area, the system delivers 1.9-19.7x speedup and 3.1-18.5x energy efficiency improvements over AGX Orin GPU. This work is the first to comprehensively tackle KV cache retrieval across algorithms and hardware, enabling real-time streaming video LLM inference on resource-constrained edge devices.

Streaming Video Question-Answering with In-context Video KV-Cache Retrieval

CV and Pattern Recognition

Answers questions about long videos instantly.

1 Mar 2025 0

87%

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

Image and Video Processing

Lets phones watch long videos without running out of memory.

18 Jun 2025 2

87%

Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

CV and Pattern Recognition

Lets computers understand long videos quickly.

10 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇰🇷 Korea, Republic of

Page Count

14 pages

V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval

Makes videos work fast on small devices.

Technical Abstract

Streaming Video Question-Answering with In-context Video KV-Cache Retrieval

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding