VLMs Guided Interpretable Decision Making for Autonomous Driving
By: Xin Hu , Taotao Jing , Renran Tian and more
Potential Business Impact:
Helps self-driving cars make safer, clearer choices.
Recent advancements in autonomous driving (AD) have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions. Building on this enriched representation, we introduce a multi-modal interactive architecture that fuses visual and linguistic features for more accurate decision-making and interpretable textual explanations. Furthermore, we design a post-hoc refinement module that utilizes VLMs to enhance prediction reliability. Extensive experiments on two autonomous driving benchmarks demonstrate that our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable AD systems.
Similar Papers
dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning
CV and Pattern Recognition
Makes self-driving cars better at tricky situations.
V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving
CV and Pattern Recognition
Helps self-driving cars see in 3D.
ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models
Robotics
Robots learn to explore and do tasks better.