Training-Free Pyramid Token Pruning for Efficient Large Vision-Language Models via Region, Token, and Instruction-Guided Importance
By: Yuxuan Liang , Xu Li , Xiaolei Chen and more
Potential Business Impact:
Focuses on important image parts for faster AI.
Large Vision-Language Models (LVLMs) have significantly advanced multimodal understanding but still struggle with efficiently processing high-resolution images. Recent approaches partition high-resolution images into multiple sub-images, dramatically increasing the number of visual tokens and causing exponential computational overhead during inference. To address these limitations, we propose a training-free token pruning strategy, Pyramid Token Pruning (PTP), that integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided importance. Inspired by human visual attention mechanisms, PTP selectively retains more tokens from visually salient regions and further leverages textual instructions to pinpoint tokens most relevant to specific multimodal tasks. Extensive experiments across 13 diverse benchmarks demonstrate that our method substantially reduces computational overhead and inference latency with minimal performance loss.
Similar Papers
GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models
CV and Pattern Recognition
Makes AI understand pictures faster and cheaper.
Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models
CV and Pattern Recognition
Makes AI understand papers faster and cheaper.
Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning
CV and Pattern Recognition
Makes AI see details with less computer power.