Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Pith reviewed 2026-05-18 03:32 UTC · model grok-4.3
The pith
Vision-language models largely fail to judge whether short videos run forward or backward, especially on irreversible physical events where humans succeed instantly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vision-language models lack the inductive biases required for temporal directionality and causal understanding in dynamic scenes. When tested on a psychophysically validated set of natural videos, the majority perform at or near chance levels while the best model still trails human observers by a wide margin on physically irreversible processes and direct causal manipulations that humans resolve almost immediately.
What carries the argument
AoT-PsyPhyBENCH, the benchmark that reuses the identical video stimuli and human behavioral baselines from psychophysics studies to measure arrow-of-time discrimination in VLMs.
Load-bearing premise
The psychophysically validated human stimuli and behavioral baselines transfer directly to VLMs without introducing model-specific biases in frame sampling, temporal aggregation, or prompt interpretation.
What would settle it
Testing the same models on a fresh set of irreversible-process videos while varying frame rates and prompt wording, then checking whether accuracy rises substantially above chance while human controls remain high.
read the original abstract
Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and has not been adequately evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best model lags far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AoT-PsyPhyBENCH, a benchmark built from psychophysically validated human stimuli for judging whether short natural video clips play forward or backward. It evaluates a range of open-weight and proprietary VLMs (reasoning and non-reasoning) and reports that most models perform near chance, with even the strongest model substantially below human accuracy on irreversible physical processes (free fall, diffusion/explosion) and causal manual actions (division/addition). The authors conclude that current VLMs lack inductive biases for temporal continuity and causal understanding, and they release the code and data.
Significance. If the empirical results hold after methodological clarification, the work would usefully document a concrete limitation in VLMs' physical and temporal reasoning and supply a reusable, human-aligned benchmark. The public release of code and data for AoT-PsyPhyBENCH is a clear strength that supports reproducibility and follow-up studies.
major comments (2)
- [§4] §4 (Experimental Evaluation): the manuscript reports no sample sizes (number of clips or trials per category), no exact prompting templates, and no statistical tests for model-versus-human comparisons. These omissions prevent verification that the near-chance results are robust rather than sensitive to evaluation choices.
- [§3] §3 (Benchmark Construction): no ablations or controls are described for frame sampling, clip duration, temporal aggregation, or prompt phrasing when adapting the human psychophysical stimuli to VLMs. Without such checks, the performance gap could partly reflect procedural mismatch rather than an absence of causal inductive biases, which is load-bearing for the central claim.
minor comments (1)
- [Figures and §4] Figure captions and the main text should consistently label the number of models and categories evaluated to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important aspects of clarity and robustness in our evaluation. We address each major comment below and have revised the manuscript accordingly to provide the requested details and controls.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Evaluation): the manuscript reports no sample sizes (number of clips or trials per category), no exact prompting templates, and no statistical tests for model-versus-human comparisons. These omissions prevent verification that the near-chance results are robust rather than sensitive to evaluation choices.
Authors: We agree that these details should have been more explicitly reported in the main text. The total number of clips and per-category breakdowns are described in Section 3, but to improve accessibility we have added a summary table in §4 listing the exact number of trials per category (e.g., 120 clips for free-fall, 95 for causal actions). The precise prompting templates used for each model family are now provided verbatim in Appendix B. We have also added statistical comparisons using McNemar’s test for paired model-human accuracy differences, with results and p-values reported alongside the main tables; all key gaps remain significant (p < 0.01). These changes directly address the concern about robustness. revision: yes
-
Referee: [§3] §3 (Benchmark Construction): no ablations or controls are described for frame sampling, clip duration, temporal aggregation, or prompt phrasing when adapting the human psychophysical stimuli to VLMs. Without such checks, the performance gap could partly reflect procedural mismatch rather than an absence of causal inductive biases, which is load-bearing for the central claim.
Authors: The benchmark deliberately re-uses the exact clip durations, frame rates, and temporal ordering from the original human psychophysics experiments to enable direct comparison; this design choice is now stated more explicitly in a new paragraph in §3. We have added a brief sensitivity analysis in the supplement showing that modest changes to prompt phrasing produce <3% variation in model accuracy and do not alter the near-chance conclusion. Full ablations on frame sampling and temporal aggregation were not performed because they would break equivalence with the human-validated stimuli, but we acknowledge the referee’s point and have included a short discussion of why such procedural variations are unlikely to explain the large human-model gap. If additional targeted ablations are required, we can perform them. revision: partial
Circularity Check
Empirical benchmark study with no circular derivations or self-referential reductions
full rationale
This paper introduces and applies AoT-PsyPhyBENCH as an empirical evaluation benchmark for VLMs on arrow-of-time judgments, directly comparing model accuracies against human psychophysical baselines on irreversible processes and causal actions. No mathematical derivations, parameter fittings, or predictions are present that reduce by construction to quantities defined inside the paper. Claims rest on observed performance gaps using externally validated stimuli rather than any self-definitional, fitted-input, or self-citation load-bearing steps. The work is self-contained as a straightforward benchmark study without internal circular chains.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human performance on the same video stimuli provides the appropriate reference standard for evaluating machine temporal reasoning.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArrowOfTime.leanarrow_from_z / z_monotone_absolute echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
humans detect reversals rapidly... in Fall (free fall) and Diffusion (diffusion/explosion)... governed by gravity and entropy; Division and Put... agent-driven causal sequences
-
IndisputableMonolith/Foundation/ArrowOfTime.leanforward_accumulates / entropy_from_berry echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
lack the inductive biases required for temporal continuity and causal understanding
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs
Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.