Score: 0

VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments

Published: August 7, 2025 | arXiv ID: 2508.05852v1

By: Kaiser Hamid, Khandakar Ashrafi Akbar, Nade Liang

Potential Business Impact:

Predicts where drivers look using words.

Driver visual attention prediction is a critical task in autonomous driving and human-computer interaction (HCI) research. Most prior studies focus on estimating attention allocation at a single moment in time, typically using static RGB images such as driving scene pictures. In this work, we propose a vision-language framework that models the changing landscape of drivers' gaze through natural language, using few-shot and zero-shot learning on single RGB images. We curate and refine high-quality captions from the BDD-A dataset using human-in-the-loop feedback, then fine-tune LLaVA to align visual perception with attention-centric scene understanding. Our approach integrates both low-level cues and top-down context (e.g., route semantics, risk anticipation), enabling language-based descriptions of gaze behavior. We evaluate performance across training regimes (few shot, and one-shot) and introduce domain-specific metrics for semantic alignment and response diversity. Results show that our fine-tuned model outperforms general-purpose VLMs in attention shift detection and interpretability. To our knowledge, this is among the first attempts to generate driver visual attention allocation and shifting predictions in natural language, offering a new direction for explainable AI in autonomous driving. Our approach provides a foundation for downstream tasks such as behavior forecasting, human-AI teaming, and multi-agent coordination.

Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling

CV and Pattern Recognition

Helps cars spot distracted drivers better.

13 Jan 2026 1

91%

Vision-Based Natural Language Scene Understanding for Autonomous Driving: An Extended Dataset and a New Model for Traffic Scene Description Generation

CV and Pattern Recognition

Lets cars describe what they see in words.

20 Jan 2026 1

91%

Robust Egocentric Visual Attention Prediction Through Language-guided Scene Context-aware Learning

CV and Pattern Recognition

Helps cameras know where to look next.

5 Jan 2026 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

10 pages

VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments

Predicts where drivers look using words.

Technical Abstract

Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling

Vision-Based Natural Language Scene Understanding for Autonomous Driving: An Extended Dataset and a New Model for Traffic Scene Description Generation

Robust Egocentric Visual Attention Prediction Through Language-guided Scene Context-aware Learning