Score: 1

CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models

Published: August 24, 2025 | arXiv ID: 2508.17243v2

By: Zicong Tang , Ziyang Ma , Suqing Wang and more

Potential Business Impact:

Makes AI understand pictures faster and cheaper.

Business Areas:

Image Recognition Data and Analytics, Software

Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos. Due to the rich visual information, a single image can generate thousands of vision tokens, leading to high computational costs during the prefilling stage and significant memory overhead during decoding. Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations. However, these methods often struggle in shallow layers due to the lack of sufficient contextual information. We argue that many visual tokens are inherently redundant even in shallow layers and can be safely and effectively pruned with appropriate contextual signals. In this work, we propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM. The PPM is lightweight, model-agnostic, and operates independently of the LVLM architecture, ensuring seamless integration with various models. Extensive experiments on multiple benchmarks demonstrate that CoViPAL outperforms training-free pruning methods under equal token budgets and surpasses training-based methods with comparable supervision. CoViPAL offers a scalable and efficient solution to improve inference efficiency in LVLMs without compromising accuracy.

CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models

CV and Pattern Recognition

Makes AI understand pictures faster and cheaper.

24 Aug 2025 1

91%

CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

CV and Pattern Recognition

Makes AI understand pictures and words faster.

11 Aug 2025 0

90%

Mitigating Information Loss under High Pruning Rates for Efficient Large Vision Language Models

CV and Pattern Recognition

Makes smart AI understand pictures faster, cheaper.

2 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

14 pages

CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models

Makes AI understand pictures faster and cheaper.

Technical Abstract

CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models

CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

Mitigating Information Loss under High Pruning Rates for Efficient Large Vision Language Models