Score: 0

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Published: June 16, 2025 | arXiv ID: 2506.13757v1

By: Zewei Zhou , Tianhui Cai , Seth Z. Zhao and more

Potential Business Impact:

Helps self-driving cars plan safer, faster trips.

Business Areas:

Autonomous Vehicles Transportation

Recent advancements in Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving by leveraging world knowledge and reasoning capabilities. However, current VLA models often struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning. In this paper, we propose AutoVLA, a novel VLA model that unifies reasoning and action generation within a single autoregressive generation model for end-to-end autonomous driving. AutoVLA performs semantic reasoning and trajectory planning directly from raw visual inputs and language instructions. We tokenize continuous trajectories into discrete, feasible actions, enabling direct integration into the language model. For training, we employ supervised fine-tuning to equip the model with dual thinking modes: fast thinking (trajectory-only) and slow thinking (enhanced with chain-of-thought reasoning). To further enhance planning performance and efficiency, we introduce a reinforcement fine-tuning method based on Group Relative Policy Optimization (GRPO), reducing unnecessary reasoning in straightforward scenarios. Extensive experiments across real-world and simulated datasets and benchmarks, including nuPlan, nuScenes, Waymo, and CARLA, demonstrate the competitive performance of AutoVLA in both open-loop and closed-loop settings. Qualitative results showcase the adaptive reasoning and accurate planning capabilities of AutoVLA in diverse scenarios.

Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

CV and Pattern Recognition

Helps self-driving cars drive smarter and faster.

25 Nov 2025 1

95%

LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction

CV and Pattern Recognition

Teaches cars to drive safely in any situation.

9 Jan 2026 1

94%

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Robotics

Teaches cars to drive by watching and understanding words.

18 Dec 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

29 pages

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Helps self-driving cars plan safer, faster trips.

Technical Abstract

Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future