Score: 1

Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

Published: December 3, 2025 | arXiv ID: 2512.03454v1

By: Haicheng Liao , Huanming Shen , Bonan Wang and more

Potential Business Impact:

Helps self-driving cars understand spoken directions better.

Business Areas:

Autonomous Vehicles Transportation

Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.

MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving

CV and Pattern Recognition

Helps self-driving cars make smarter, safer choices.

4 Dec 2025 1

90%

AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving

CV and Pattern Recognition

Helps self-driving cars think fast or slow.

17 Sep 2025 0

89%

ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving

CV and Pattern Recognition

Helps self-driving cars imagine and plan better.

15 Aug 2025 1

View PDF Login to Bookmark

Page Count

19 pages

Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

Helps self-driving cars understand spoken directions better.

Technical Abstract

MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving

AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving

ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving