Score: 0

LightVLM: Acceleraing Large Multimodal Models with Pyramid Token Merging and KV Cache Compression

Published: August 30, 2025 | arXiv ID: 2509.00419v1

By: Lianyu Hu , Fanhua Shang , Wei Feng and more

Potential Business Impact:

Makes AI understand pictures much faster.

Business Areas:

Image Recognition Data and Analytics, Software

In this paper, we introduce LightVLM, a simple but effective method that can be seamlessly deployed upon existing Vision-Language Models (VLMs) to greatly accelerate the inference process in a training-free manner. We divide the inference procedure of VLMs into two stages, i.e., encoding and decoding, and propose to simultaneously accelerate VLMs in both stages to largely improve model efficiency. During encoding, we propose pyramid token merging to reduce tokens of different LLM layers in a hierarchical manner by finally only keeping a few dominant tokens to achieve high efficiency. During decoding, aimed at reducing the high latency of outputting long sequences, we propose KV Cache compression to remove unnecessary caches to increase the network throughput. Experimental results show that LightVLM successfully retains 100% performance when only preserving 35% image tokens, and maintains around 98% performance when keeping only 3% image tokens. LightVLM could 2.02$\times$ the network throughput and reduce the prefilling time by 3.65$\times$. LightVLM also makes large VLMs faster again by enabling a heavy model (e.g., InternVL2.5 26B) to infer faster than significantly smaller models (e.g., InternVL2.5 8B), hopefully facilitating the real-world deployment. When generating long text sequences (e.g., 4096 tokens), LightVLM could reduce the inference time by 3.21$\times$, largely outperforming existing methods.

LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit

CV and Pattern Recognition

Makes AI understand pictures and words better, faster.

13 Aug 2025 1

89%

Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models

CV and Pattern Recognition

Makes computers understand pictures much faster.

8 Aug 2025 0

89%

Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models

CV and Pattern Recognition

Makes AI understand pictures much faster.

8 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

14 pages

LightVLM: Acceleraing Large Multimodal Models with Pyramid Token Merging and KV Cache Compression

Makes AI understand pictures much faster.

Technical Abstract

LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit

Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models

Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models