Efficient Vision-Language Reasoning via Adaptive Token Pruning
By: Xue Li, Xiaonan Song, Henry Hu
Potential Business Impact:
Makes AI understand pictures faster and cheaper.
Real-world deployment of Vision-Language Models (VLMs) is hindered by high computational demands, as existing architectures inefficiently process all tokens uniformly. We introduce Adaptive Token Pruning (ATP), a dynamic inference mechanism that retains only the most informative tokens based on contextual relevance. ATP operates at the vision-language interface, assigning a hybrid importance score combining ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance) to keep top-K tokens for the LLM. Unlike static compression, ATP adapts to each input without modifying the backbone. Proposed as a lightweight gating module, ATP is compatible with popular backbones like BLIP-2, LLaVA, and Flamingo. Preliminary evaluations across VQAv2, GQA, and COCO indicate that ATP reduces inference FLOPs by around 40% and achieves roughly 1.5x speedups in end-to-end latency with negligible accuracy loss (less than 1%). Qualitative analyses suggest ATP preserves visual grounding and enhances interpretability. Beyond efficiency, we investigate robustness under corruptions; observations suggest adaptive pruning suppresses spurious correlations, improving stability. These findings imply that resource-constrained inference and model reliability are not competing objectives. Finally, we discuss ATP's role in efficient multimodal edge computing pipelines.
Similar Papers
CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning
CV and Pattern Recognition
Makes AI understand pictures and words faster.
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
CV and Pattern Recognition
Lets computers see smarter, using less data.
Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing
CV and Pattern Recognition
Lets computers watch long videos faster.