Score: 0

Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling

Published: January 13, 2026 | arXiv ID: 2601.08467v1

By: Takamichi Miyata, Sumiko Miyata, Andrew Morris

Distracted driving is a major cause of traffic collisions, calling for robust and scalable detection methods. Vision-language models (VLMs) enable strong zero-shot image classification, but existing VLM-based distracted driver detectors often underperform in real-world conditions. We identify subject-specific appearance variations (e.g., clothing, age, and gender) as a key bottleneck: VLMs entangle these factors with behavior cues, leading to decisions driven by who the driver is rather than what the driver is doing. To address this, we propose a subject decoupling framework that extracts a driver appearance embedding and removes its influence from the image embedding prior to zero-shot classification, thereby emphasizing distraction-relevant evidence. We further orthogonalize text embeddings via metric projection onto Stiefel manifold to improve separability while staying close to the original semantics. Experiments demonstrate consistent gains over prior baselines, indicating the promise of our approach for practical road-safety applications.

VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments

CV and Pattern Recognition

Predicts where drivers look using words.

7 Aug 2025 0

92%

Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

CV and Pattern Recognition

Helps cars watch drivers and roads for safety.

28 Nov 2025 1

92%

Spatial-aware Vision Language Model for Autonomous Driving

CV and Pattern Recognition

Helps self-driving cars see in 3D.

30 Dec 2025 0

View PDF Login to Bookmark

Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling

Technical Abstract

VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments

Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

Spatial-aware Vision Language Model for Autonomous Driving