Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
By: Jewon Lee , Ki-Ung Song , Seungmin Yang and more
Potential Business Impact:
Makes AI see images faster and use less power.
Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.
Similar Papers
LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models
Computation and Language
Makes smart AI see and think faster.
AdaFV: Rethinking of Visual-Language alignment for VLM acceleration
CV and Pattern Recognition
Makes smart computers understand pictures faster.
STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference
Machine Learning (CS)
Speeds up AI that understands pictures and words.