Score: 0

TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts

Published: December 28, 2025 | arXiv ID: 2512.22748v1

By: Hao Zhang , Mengsi Lyu , Bo Huang and more

Potential Business Impact:

Makes AI understand many pictures faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model. However, the growing number of visual tokens greatly increases inference cost. Visual token pruning has emerged as a promising solution. However, existing methods often overlook scenarios involving long context inputs with multiple images. In this paper, we analyze the challenges of visual token pruning in long context, multi-image settings and introduce an adaptive pruning method tailored for such scenarios. We decompose redundancy into intra-image and inter-image components and quantify them through intra-image diversity and inter-image variation, which jointly guide dynamic budget allocation. Our approach consists of two stages. The intra-image stage allocates each image a content-aware token budget and greedily selects its most representative tokens. The inter-image stage performs global diversity filtering to form a candidate pool and then applies a Pareto selection procedure that balances diversity with text alignment. Extensive experiments show that our approach maintains strong performance in long context settings while significantly cutting down the number of visual tokens.

Towards Adaptive Visual Token Pruning for Large Multimodal Models

CV and Pattern Recognition

Makes AI understand pictures faster and cheaper.

30 Aug 2025 0

93%

CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

CV and Pattern Recognition

Makes AI understand pictures and words faster.

11 Aug 2025 0

91%

Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

CV and Pattern Recognition

Makes AI understand pictures faster and better.

4 Nov 2025 0

View PDF Login to Bookmark

Page Count

16 pages

TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts

Makes AI understand many pictures faster.

Technical Abstract

Towards Adaptive Visual Token Pruning for Large Multimodal Models

CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models