Score: 1

VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving

Published: November 9, 2025 | arXiv ID: 2511.06256v1

By: Ruifei Zhang , Wei Zhang , Xiao Tan and more

Potential Business Impact:

Makes self-driving cars see better and crash less.

Business Areas:

Autonomous Vehicles Transportation

Recent advancements in language-grounded autonomous driving have been significantly promoted by the sophisticated cognition and reasoning capabilities of large language models (LLMs). However, current LLM-based approaches encounter critical challenges: (1) Failure analysis reveals that frequent collisions and obstructions, stemming from limitations in visual representations, remain primary obstacles to robust driving performance. (2) The substantial parameters of LLMs pose considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. VLDrive achieves compact visual tokens through innovative strategies, including cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation. Furthermore, we propose a distance-decoupled instruction attention mechanism to improve joint visual-linguistic feature learning, particularly for long-range visual tokens. Extensive experiments conducted in the CARLA simulator demonstrate VLDrive`s effectiveness. Notably, VLDrive achieves state-of-the-art driving performance while reducing parameters by 81% (from 7B to 1.3B), yielding substantial driving score improvements of 15.4%, 16.8%, and 7.6% at tiny, short, and long distances, respectively, in closed-loop evaluations. Code is available at https://github.com/ReaFly/VLDrive.

dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

CV and Pattern Recognition

Makes self-driving cars better at tricky situations.

4 Dec 2025 0

91%

V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving

CV and Pattern Recognition

Helps self-driving cars see in 3D.

30 Apr 2025 0

91%

Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

CV and Pattern Recognition

Helps cars watch drivers and roads for safety.

28 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

11 pages

VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving

Makes self-driving cars see better and crash less.

Technical Abstract

dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving

Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach