Score: 1

Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

Published: October 16, 2025 | arXiv ID: 2510.14624v1

By: Natan Bagrov , Eugene Khvedchenia , Borys Tymchenko and more

BigTech Affiliations: NVIDIA

Potential Business Impact:

Makes videos understandable for computers faster.

Business Areas:

Image Recognition Data and Analytics, Software

Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches -- spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.

EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models

CV and Pattern Recognition

Makes AI understand pictures faster by picking key parts.

16 Aug 2025 1

90%

Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing

CV and Pattern Recognition

Lets computers watch long videos faster.

25 Aug 2025 1

90%

From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding

CV and Pattern Recognition

Helps computers understand long videos better.

2 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

17 pages

Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

Makes videos understandable for computers faster.

Technical Abstract

EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models

Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing

From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding