DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
By: Yu Gao , Yiru Wang , Anqing Jiang and more
Potential Business Impact:
Helps self-driving cars understand and drive safely.
Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
Similar Papers
DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
Robotics
Helps self-driving cars understand and drive safely.
Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving
Robotics
Teaches cars to drive safely by thinking.
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
CV and Pattern Recognition
Helps self-driving cars plan safer, faster trips.