SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
By: Jingyu Li , Junjie Wu , Dongnan Hu and more
Potential Business Impact:
Teaches self-driving cars to understand roads better.
Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.
Similar Papers
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
CV and Pattern Recognition
Helps self-driving cars understand where things are.
Spatial-aware Vision Language Model for Autonomous Driving
CV and Pattern Recognition
Helps self-driving cars see in 3D.
Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles
CV and Pattern Recognition
Helps self-driving cars understand spoken directions better.