OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving
By: Pei Liu , Qingtian Ning , Xinyan Lu and more
Potential Business Impact:
Helps self-driving cars understand moving scenes.
Recent advances in vision-language models (VLMs) have demonstrated impressive spatial reasoning capabilities for autonomous driving, yet existing methods predominantly focus on static scene understanding while neglecting the essential temporal dimension of real-world driving scenarios. To address this critical limitation, we propose the OmniReason framework, which establishes robust spatiotemporal reasoning by jointly modeling dynamic 3D environments and their underlying decision-making processes. Our work makes two fundamental advances: (1) We introduce OmniReason-Data, two large-scale vision-language-action (VLA) datasets with dense spatiotemporal annotations and natural language explanations, generated through a novel hallucination-mitigated auto-labeling pipeline that ensures both physical plausibility and temporal coherence; (2) We develop the OmniReason-Agent architecture, which integrates a sparse temporal memory module for persistent scene context modeling and an explanation generator that produces human-interpretable decision rationales, facilitated by our spatiotemporal knowledge distillation approach that effectively captures spatiotemporal causal reasoning patterns. Comprehensive experiments demonstrate state-of-the-art performance, where OmniReason-Agent achieves significant improvements in both open-loop planning tasks and visual question answering (VQA) benchmarks, while establishing new capabilities for interpretable, temporally-aware autonomous vehicles operating in complex, dynamic environments.
Similar Papers
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
CV and Pattern Recognition
Teaches computers to understand space like humans.
OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving
CV and Pattern Recognition
Helps self-driving cars understand scenes like people.
Omni-AutoThink: Adaptive Multimodal Reasoning via Reinforcement Learning
Artificial Intelligence
Helps AI think better for different tasks.