Score: 1

SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

Published: October 20, 2025 | arXiv ID: 2510.17777v1

By: Samir Khaki , Junxian Guo , Jiaming Tang and more

Potential Business Impact:

Makes AI understand pictures and words much faster.

Business Areas:

Visual Search Internet Services

Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks -- while improving accuracy on document-understanding and reasoning tasks. By decoupling query-agnostic pruning and query-aware retrieval, SparseVILA establishes a new direction for efficient multimodal inference, offering a training-free, architecture-agnostic framework for accelerating large VLMs without sacrificing capability.

MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

CV and Pattern Recognition

Makes AI understand long videos much faster.

22 Apr 2025 1

89%

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

CV and Pattern Recognition

Lets AI remember long videos and stories.

9 Dec 2025 0

89%

AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance

CV and Pattern Recognition

Makes AI understand pictures much faster.

8 Aug 2025 1

View PDF Login to Bookmark

Page Count

18 pages

SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

Makes AI understand pictures and words much faster.

Technical Abstract

MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance