Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
By: Shiho Matta , Lis Kanashiro Pereira , Peitao Han and more
Potential Business Impact:
Helps computers understand if videos play forward or backward.
Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
Similar Papers
Seeing the Arrow of Time in Large Multimodal Models
CV and Pattern Recognition
Teaches computers to understand video direction.
A Matter of Time: Revealing the Structure of Time in Vision-Language Models
CV and Pattern Recognition
Lets computers understand when pictures were taken.
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
CV and Pattern Recognition
Tests AI's grasp of video movements and fixes gaps