Score: 1

Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models

Published: December 1, 2025 | arXiv ID: 2512.01949v1

By: Zhongyu Yang , Dannong Xu , Wei Pang and more

Potential Business Impact:

Makes AI see pictures and videos faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The rapid growth of visual tokens in multimodal large language models (MLLMs) leads to excessive memory consumption and inference latency, especially when handling high-resolution images and videos. Token pruning is a technique used to mitigate this issue by removing redundancy, but existing methods often ignore relevance to the user query or suffer from the limitations of attention mechanisms, reducing their adaptability and effectiveness. To address these challenges, we propose Script, a plug-and-play pruning method that requires no retraining and generalizes across diverse MLLMs. Script comprises two modules: a graph-structured pruning module that removes visually redundant tokens, and a query-conditioned semantic pruning module that preserves query-relevant visual information. Together, they enhance performance on multimodal tasks. Experiments on fourteen benchmarks across image and video understanding tasks show that Script consistently achieves higher model efficiency and predictive accuracy compared to existing pruning methods. On LLaVA-NeXT-7B, it achieves up to 6.8x prefill speedup and 10x FLOP reduction, while retaining 96.88% of the original performance.

GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs

CV and Pattern Recognition

Makes AI understand pictures faster by focusing on important parts.

13 Nov 2025 0

89%

What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph

CV and Pattern Recognition

Makes AI understand pictures faster and better.

4 Jan 2025 1

89%

Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing

CV and Pattern Recognition

Lets computers watch long videos faster.

25 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇬🇧 United Kingdom

Repos / Data Links

github.com

Page Count

40 pages

Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models

Makes AI see pictures and videos faster.

Technical Abstract

GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs

What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph

Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing