Score: 0

Training-Free Pyramid Token Pruning for Efficient Large Vision-Language Models via Region, Token, and Instruction-Guided Importance

Published: September 19, 2025 | arXiv ID: 2509.15704v1

By: Yuxuan Liang , Xu Li , Xiaolei Chen and more

Potential Business Impact:

Focuses on important image parts for faster AI.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Vision-Language Models (LVLMs) have significantly advanced multimodal understanding but still struggle with efficiently processing high-resolution images. Recent approaches partition high-resolution images into multiple sub-images, dramatically increasing the number of visual tokens and causing exponential computational overhead during inference. To address these limitations, we propose a training-free token pruning strategy, Pyramid Token Pruning (PTP), that integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided importance. Inspired by human visual attention mechanisms, PTP selectively retains more tokens from visually salient regions and further leverages textual instructions to pinpoint tokens most relevant to specific multimodal tasks. Extensive experiments across 13 diverse benchmarks demonstrate that our method substantially reduces computational overhead and inference latency with minimal performance loss.

GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models

CV and Pattern Recognition

Makes AI understand pictures faster and cheaper.

16 Jun 2025 1

90%

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

CV and Pattern Recognition

Makes AI understand papers faster and cheaper.

8 Sep 2025 1

90%

Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning

CV and Pattern Recognition

Makes AI see details with less computer power.

25 Apr 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

17 pages

Training-Free Pyramid Token Pruning for Efficient Large Vision-Language Models via Region, Token, and Instruction-Guided Importance

Focuses on important image parts for faster AI.

Technical Abstract

GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning