Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference
By: Putu Indah Githa Cahyani, Komang David Dananjaya Suartana, Novanto Yudistira
Vision-Language Models (VLMs) have demonstrated strong performance on multimodal reasoning tasks, but their deployment remains challenging due to high inference latency and computational cost, particularly when processing high-resolution visual inputs. While recent architectures such as FastVLM improve efficiency through optimized vision encoders, existing pipelines still rely on static visual preprocessing, leading to redundant computation for visually simple inputs. In this work, we propose an adaptive visual preprocessing method that dynamically adjusts input resolution and spatial coverage based on image content characteristics. The proposed approach combines content-aware image analysis, adaptive resolution selection, and content-aware cropping to reduce visual redundancy prior to vision encoding. Importantly, the method is integrated with FastVLM without modifying its architecture or requiring retraining. We evaluate the proposed method on a subset of the DocVQA dataset in an inference-only setting, focusing on efficiency-oriented metrics. Experimental results show that adaptive preprocessing reduces per-image inference time by over 50\%, lowers mean full generation time, and achieves a consistent reduction of more than 55\% in visual token count compared to the baseline pipeline. These findings demonstrate that input-aware preprocessing is an effective and lightweight strategy for improving deployment-oriented efficiency of vision-language models. To facilitate reproducibility, our implementation is provided as a fork of the FastVLM repository, incorporating the files for the proposed method, and is available at https://github.com/kmdavidds/mlfastlm.
Similar Papers
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
CV and Pattern Recognition
Lets computers see smarter, using less data.
Harnessing Input-Adaptive Inference for Efficient VLN
CV and Pattern Recognition
Helps robots navigate using less computer power.
Rethinking Visual Intelligence: Insights from Video Pretraining
CV and Pattern Recognition
Video models learn faster than text models.