Score: 1

StreamingTOM: Streaming Token Compression for Efficient Video Understanding

Published: October 21, 2025 | arXiv ID: 2510.18269v1

By: Xueyi Chen , Keda Tao , Kele Shao and more

Potential Business Impact:

Makes computers understand videos faster and cheaper.

Business Areas:

Video Streaming Content and Publishing, Media and Entertainment, Video

Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8\%$ on offline benchmarks and $55.8\%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.

Accelerating Streaming Video Large Language Models via Hierarchical Token Compression

CV and Pattern Recognition

Makes videos play faster without losing quality.

30 Nov 2025 1

89%

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

CV and Pattern Recognition

Makes AI videos follow instructions better.

9 Oct 2025 1

88%

Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior

CV and Pattern Recognition

Makes watching long videos faster for computers.

7 Dec 2025 2

View PDF Login to Bookmark

Page Count

10 pages

StreamingTOM: Streaming Token Compression for Efficient Video Understanding

Makes computers understand videos faster and cheaper.

Technical Abstract

Accelerating Streaming Video Large Language Models via Hierarchical Token Compression

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior