LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments
By: Jin Huang , Yuchao Jin , Le An and more
Potential Business Impact:
Makes robots and cars understand the world faster.
This paper introduces an efficient Vision-Language Model (VLM) pipeline specifically optimized for deployment on embedded devices, such as those used in robotics and autonomous driving. The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views, a token selection module to reduce input sequence length for the LLM, and speculative decoding to accelerate token generation. Evaluation on the NVIDIA DRIVE Thor platform for automonous driving application, our pipeline achieves $2.5\times$ end-to-end latency reduction without compromising task accuracy. The speed-up further increases to $3.2\times$ when applying FP8 post-training quantization. These results demonstrate our pipeline as a viable solution for enabling real-time VLM deployment in resource-constrained environments.
Similar Papers
A Survey on Efficient Vision-Language Models
CV and Pattern Recognition
Makes smart AI work on small, slow devices.
Towards Fast, Memory-based and Data-Efficient Vision-Language Policy
CV and Pattern Recognition
Robots learn tasks faster and remember more.
SmolVLM: Redefining small and efficient multimodal models
Artificial Intelligence
Makes smart AI work on phones, not just big computers.