Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals
By: Shruti Singh Baghel , Yash Pratap Singh Rathore , Sushovan Jena and more
Potential Business Impact:
Helps blind people understand videos better.
Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.
Similar Papers
Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations
CV and Pattern Recognition
Helps blind people get answers they need faster.
"It's trained by non-disabled people": Evaluating How Image Quality Affects Product Captioning with VLMs
Human-Computer Interaction
Helps blind people understand products better.
"It's trained by non-disabled people": Evaluating How Image Quality Affects Product Captioning with VLMs
Human-Computer Interaction
Helps blind people understand products better.