Score: 1

Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

Published: August 5, 2025 | arXiv ID: 2508.03337v6

By: Shaoguang Wang , Ziyang Chen , Yijie Xu and more

Potential Business Impact:

Makes videos easier for computers to understand.

The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While increasing the number of sampled frames is a common strategy, we observe a "less is more" phenomenon where excessive frames can paradoxically degrade performance due to context dilution. Concurrently, state-of-the-art keyframe selection methods, while effective, still yield significant temporal redundancy, which we term 'visual echoes'. To address these dual challenges, we propose Adaptive Frame-Pruning (AFP), a novel post-processing method that intelligently prunes the selected keyframes. AFP employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives. To compensate for information loss, we then introduce a lightweight, text-based semantic graph that provides critical context with minimal token overhead. Conducting extensive experiments on the LongVideoBench and VideoMME benchmarks across multiple leading MLLMs, our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%. Crucially, by providing a concise, high-quality set of frames, our method not only enhances efficiency but often improves accuracy over baselines that use more frames. The code will be released upon publication.

Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

CV and Pattern Recognition

Makes AI understand videos using fewer pictures.

5 Aug 2025 1

90%

Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

Machine Learning (CS)

Makes computers understand long videos faster.

13 Mar 2025 1

90%

From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding

CV and Pattern Recognition

Helps computers understand long videos better.

2 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇭🇰 Hong Kong

Page Count

20 pages

Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

Makes videos easier for computers to understand.

Technical Abstract

Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding