STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference
By: Yichen Guo , Hanze Li , Zonghao Zhang and more
Potential Business Impact:
Speeds up AI that understands pictures and words.
Although large vision-language models (LVLMs) leverage rich visual token representations to achieve strong performance on multimodal tasks, these tokens also introduce significant computational overhead during inference. Existing training-free token pruning methods typically adopt a single-stage strategy, focusing either on visual self-attention or visual-textual cross-attention. However, such localized perspectives often overlook the broader information flow across the model, leading to substantial performance degradation, especially under high pruning ratios. In this work, we propose STAR (Stage-wise Attention-guided token Reduction), a training-free, plug-and-play framework that approaches token pruning from a global perspective. Instead of pruning at a single point, STAR performs attention-guided reduction in two complementary stages: an early-stage pruning based on visual self-attention to remove redundant low-level features, and a later-stage pruning guided by cross-modal attention to discard task-irrelevant tokens. This holistic approach allows STAR to significantly reduce computational cost while better preserving task-critical information. Extensive experiments across multiple LVLM architectures and benchmarks show that STAR achieves strong acceleration while maintaining comparable, and in some cases even improved performance.
Similar Papers
LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models
Computation and Language
Makes smart AI see and think faster.
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
CV and Pattern Recognition
Makes AI see images faster and use less power.
Similarity-Aware Token Pruning: Your VLM but Faster
CV and Pattern Recognition
Makes AI see and understand faster.