pith. machine review for the scientific record. sign in

arxiv: 2510.26241 · v5 · submitted 2025-10-30 · 💻 cs.CV · cs.CL

Recognition: unknown

Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models

Authors on Pith no claims yet
classification 💻 cs.CV cs.CL
keywords temporalvlmsmodelsaot-psyphybenchcausalevaluationhumansmultimodal
0
0 comments X
read the original abstract

Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and has not been adequately evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best model lags far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temp...