pith. sign in

arxiv: 2510.26241 · v5 · submitted 2025-10-30 · 💻 cs.CV · cs.CL

Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models

Pith reviewed 2026-05-18 03:32 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords vision-language modelsarrow of timetemporal reasoningpsychophysicsbenchmarkcausal understandingphysical irreversibilityvideo direction
0
0 comments X

The pith

Vision-language models largely fail to judge whether short videos run forward or backward, especially on irreversible physical events where humans succeed instantly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests vision-language models on the arrow of time task using the same short natural video clips and human performance baselines established in prior psychophysics work. Most models score near chance, and even the strongest ones fall well behind human accuracy on clips showing free fall, explosions, diffusion, or causal manual actions such as division and addition. The results indicate that current VLMs capture visual-semantic patterns yet lack the inductive biases needed for temporal continuity and physical causality. The authors release the benchmark, code, and data to support targeted improvements in these areas.

Core claim

Vision-language models lack the inductive biases required for temporal directionality and causal understanding in dynamic scenes. When tested on a psychophysically validated set of natural videos, the majority perform at or near chance levels while the best model still trails human observers by a wide margin on physically irreversible processes and direct causal manipulations that humans resolve almost immediately.

What carries the argument

AoT-PsyPhyBENCH, the benchmark that reuses the identical video stimuli and human behavioral baselines from psychophysics studies to measure arrow-of-time discrimination in VLMs.

Load-bearing premise

The psychophysically validated human stimuli and behavioral baselines transfer directly to VLMs without introducing model-specific biases in frame sampling, temporal aggregation, or prompt interpretation.

What would settle it

Testing the same models on a fresh set of irreversible-process videos while varying frame rates and prompt wording, then checking whether accuracy rises substantially above chance while human controls remain high.

read the original abstract

Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and has not been adequately evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best model lags far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AoT-PsyPhyBENCH, a benchmark built from psychophysically validated human stimuli for judging whether short natural video clips play forward or backward. It evaluates a range of open-weight and proprietary VLMs (reasoning and non-reasoning) and reports that most models perform near chance, with even the strongest model substantially below human accuracy on irreversible physical processes (free fall, diffusion/explosion) and causal manual actions (division/addition). The authors conclude that current VLMs lack inductive biases for temporal continuity and causal understanding, and they release the code and data.

Significance. If the empirical results hold after methodological clarification, the work would usefully document a concrete limitation in VLMs' physical and temporal reasoning and supply a reusable, human-aligned benchmark. The public release of code and data for AoT-PsyPhyBENCH is a clear strength that supports reproducibility and follow-up studies.

major comments (2)
  1. [§4] §4 (Experimental Evaluation): the manuscript reports no sample sizes (number of clips or trials per category), no exact prompting templates, and no statistical tests for model-versus-human comparisons. These omissions prevent verification that the near-chance results are robust rather than sensitive to evaluation choices.
  2. [§3] §3 (Benchmark Construction): no ablations or controls are described for frame sampling, clip duration, temporal aggregation, or prompt phrasing when adapting the human psychophysical stimuli to VLMs. Without such checks, the performance gap could partly reflect procedural mismatch rather than an absence of causal inductive biases, which is load-bearing for the central claim.
minor comments (1)
  1. [Figures and §4] Figure captions and the main text should consistently label the number of models and categories evaluated to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of clarity and robustness in our evaluation. We address each major comment below and have revised the manuscript accordingly to provide the requested details and controls.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Evaluation): the manuscript reports no sample sizes (number of clips or trials per category), no exact prompting templates, and no statistical tests for model-versus-human comparisons. These omissions prevent verification that the near-chance results are robust rather than sensitive to evaluation choices.

    Authors: We agree that these details should have been more explicitly reported in the main text. The total number of clips and per-category breakdowns are described in Section 3, but to improve accessibility we have added a summary table in §4 listing the exact number of trials per category (e.g., 120 clips for free-fall, 95 for causal actions). The precise prompting templates used for each model family are now provided verbatim in Appendix B. We have also added statistical comparisons using McNemar’s test for paired model-human accuracy differences, with results and p-values reported alongside the main tables; all key gaps remain significant (p < 0.01). These changes directly address the concern about robustness. revision: yes

  2. Referee: [§3] §3 (Benchmark Construction): no ablations or controls are described for frame sampling, clip duration, temporal aggregation, or prompt phrasing when adapting the human psychophysical stimuli to VLMs. Without such checks, the performance gap could partly reflect procedural mismatch rather than an absence of causal inductive biases, which is load-bearing for the central claim.

    Authors: The benchmark deliberately re-uses the exact clip durations, frame rates, and temporal ordering from the original human psychophysics experiments to enable direct comparison; this design choice is now stated more explicitly in a new paragraph in §3. We have added a brief sensitivity analysis in the supplement showing that modest changes to prompt phrasing produce <3% variation in model accuracy and do not alter the near-chance conclusion. Full ablations on frame sampling and temporal aggregation were not performed because they would break equivalence with the human-validated stimuli, but we acknowledge the referee’s point and have included a short discussion of why such procedural variations are unlikely to explain the large human-model gap. If additional targeted ablations are required, we can perform them. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark study with no circular derivations or self-referential reductions

full rationale

This paper introduces and applies AoT-PsyPhyBENCH as an empirical evaluation benchmark for VLMs on arrow-of-time judgments, directly comparing model accuracies against human psychophysical baselines on irreversible processes and causal actions. No mathematical derivations, parameter fittings, or predictions are present that reduce by construction to quantities defined inside the paper. Claims rest on observed performance gaps using externally validated stimuli rather than any self-definitional, fitted-input, or self-citation load-bearing steps. The work is self-contained as a straightforward benchmark study without internal circular chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the transferability of human psychophysical stimuli and the interpretation that near-chance performance indicates missing inductive biases for temporal continuity.

axioms (1)
  • domain assumption Human performance on the same video stimuli provides the appropriate reference standard for evaluating machine temporal reasoning.
    The benchmark is explicitly constructed around previously established human behavioral baselines.

pith-pipeline@v0.9.0 · 5751 in / 1182 out tokens · 45825 ms · 2026-05-18T03:32:11.669905+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temp...