Fine-Tuning Vision-Language Models for Visual Navigation Assistance
By: Xiao Li , Bharat Gandhi , Ming Zhan and more
Potential Business Impact:
Helps blind people navigate indoors with voice.
We address vision-language-driven indoor navigation to assist visually impaired individuals in reaching a target location using images and natural language guidance. Traditional navigation systems are ineffective indoors due to the lack of precise location data. Our approach integrates vision and language models to generate step-by-step navigational instructions, enhancing accessibility and independence. We fine-tune the BLIP-2 model with Low Rank Adaptation (LoRA) on a manually annotated indoor navigation dataset. We propose an evaluation metric that refines the BERT F1 score by emphasizing directional and sequential variables, providing a more comprehensive measure of navigational performance. After applying LoRA, the model significantly improved in generating directional instructions, overcoming limitations in the original BLIP-2 model.
Similar Papers
Vision-Based Localization and LLM-based Navigation for Indoor Environments
Machine Learning (CS)
Guides you indoors using phone camera and AI.
Floorplan2Guide: LLM-Guided Floorplan Parsing for BLV Indoor Navigation
Artificial Intelligence
Helps blind people navigate buildings using maps.
Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces
CV and Pattern Recognition
Robots follow spoken directions to find places.