Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective
By: Lei Lei , Jie Gu , Xiaokang Ma and more
Potential Business Impact:
Makes AI understand pictures faster and cheaper.
Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency. Previous works generally assume that all visual tokens are necessary in the shallow layers of LLMs, and therefore token compression typically occurs in intermediate layers. In contrast, our study reveals an interesting insight: with proper selection, token compression is feasible at the input stage of LLM with negligible performance loss. Specifically, we reveal that explainability methods can effectively evaluate the importance of each visual token with respect to the given instruction, which can well guide the token compression. Furthermore, we propose to learn a mapping from the attention map of the first LLM layer to the explanation results, thereby avoiding the need for a full inference pass and facilitating practical deployment. Interestingly, this mapping can be learned using a simple and lightweight convolutional network, whose training is efficient and independent of MLLMs. Extensive experiments on 10 image and video benchmarks across three leading MLLMs (Qwen2-VL, LLaVA-OneVision, and VILA1.5) demonstrate the effectiveness of our approach, e.g., pruning 50% visual tokens while retaining more than 96% of the original performance across all benchmarks for all these three MLLMs. It also exhibits strong generalization, even when the number of tokens in inference far exceeds that used in training.
Similar Papers
Token Sequence Compression for Efficient Multimodal Computing
CV and Pattern Recognition
Makes AI understand pictures and words faster.
Towards Adaptive Visual Token Pruning for Large Multimodal Models
CV and Pattern Recognition
Makes AI understand pictures faster and cheaper.
TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models
CV and Pattern Recognition
Makes AI understand pictures faster with fewer details.