Score: 4

VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking

Published: November 24, 2025 | arXiv ID: 2511.18692v1

By: Kichang Yang , Seonjun Kim , Minjae Kim and more

BigTech Affiliations: Meta Amazon

Potential Business Impact:

Makes AI models run faster on small devices.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Edge deployment of large Vision-Language Models (VLMs) increasingly relies on flash-based weight offloading, where activation sparsification is used to reduce I/O overhead. However, conventional sparsification remains model-centric, selecting neurons solely by activation magnitude and neglecting how access patterns influence flash performance. We present Neuron Chunking, an I/O-efficient sparsification strategy that operates on chunks (i.e., groups of contiguous neurons in memory) and couples neuron importance with storage access cost. The method models I/O latency through a lightweight abstraction of access contiguity and selects chunks with high utility, defined as neuron importance normalized by estimated latency. By aligning sparsification decisions with the underlying storage behavior, Neuron Chunking improves I/O efficiency by up to 4.65x and 5.76x on Jetson Orin Nano and Jetson AGX Orin, respectively.

Resting Neurons, Active Insights: Improving Input Sparsification for Large Language Models

Machine Learning (CS)

Makes big computer brains work better and faster.

14 Dec 2025 2

87%

LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation

CV and Pattern Recognition

Makes AI understand pictures faster, using less power.

15 Apr 2025 0

87%

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

CV and Pattern Recognition

Makes smart AI work on your phone.

16 Dec 2025 1

View PDF Login to Bookmark

Country of Origin

🇰🇷 🇺🇸 Korea, Republic of, United States

Repos / Data Links

github.com

Page Count

25 pages

VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking

Makes AI models run faster on small devices.

Technical Abstract

Resting Neurons, Active Insights: Improving Input Sparsification for Large Language Models

LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices