YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Jie-Ying Lee; Kaipeng Zhang; You-Zhe Xie; Yu-Hsuan Li; Yu-Lun Liu; Zhixiang Wang

arxiv: 2605.30346 · v1 · pith:N5A4EID6new · submitted 2026-05-28 · 💻 cs.CV

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

You-Zhe Xie , Yu-Hsuan Li , Jie-Ying Lee , Kaipeng Zhang , Yu-Lun Liu , Zhixiang Wang This is my paper

Pith reviewed 2026-06-29 08:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords video diffusion modelscausalityworld modelsbenchmarkcounterfactualsarrow of timeviolation of expectation

0 comments

The pith

Video diffusion models notice when time runs backward but do not grasp cause and effect like humans do.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether video diffusion models function as world models by checking if they understand causality or only statistical timing patterns. It introduces YoCausal, which reverses real videos at no cost to create natural counterfactual examples and measures both time-direction awareness and causal reasoning separately. On 13 current models, time-direction detection turns out to be unrelated to actual causal understanding, leaving a clear shortfall compared with human performance. This distinction matters because models positioned as simulators of the physical world need reliable cause-effect reasoning to predict what happens next in new situations.

Core claim

YoCausal is a two-level benchmark that first quantifies arrow-of-time perception through a Reverse Surprise Index based on denoising loss when videos are played backward, then applies a Causality Cognition Index that uses a vision-language model to split videos into causal and non-causal groups. Evaluation across 13 state-of-the-art video diffusion models shows that strong performance on the first index does not produce strong performance on the second, revealing that temporal pattern recognition alone does not deliver causal cognition and that current models remain far from human levels on real-world videos.

What carries the argument

YoCausal benchmark that creates natural counterfactuals by temporally reversing real videos, then computes Reverse Surprise Index for time-direction sensitivity and Causality Cognition Index to isolate genuine causal reasoning from temporal bias.

If this is right

Models can detect time reversal without acquiring causal reasoning.
Synthetic-data benchmarks may overlook real-world causal failures.
Current video diffusion models fall short of human causal cognition.
The two-level protocol can be extended to new models at low cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improving the Causality Cognition Index could lead models to generate more physically consistent future frames.
The same reversal technique might expose causal gaps in other generative domains such as audio or 3D scenes.
Explicit causal objectives beyond standard diffusion training may be needed to close the human gap.

Load-bearing premise

Reversing real-world videos produces valid natural counterfactual samples, and a vision-language model can accurately separate causal from non-causal videos.

What would settle it

A model that scores equally on causal and non-causal subsets in the Causality Cognition Index or reaches human-level scores on both indices would contradict the reported gap between time perception and causal understanding.

read the original abstract

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's time-reversal benchmark for separating temporal perception from causality in video models has a practical setup but rests on unvalidated assumptions about counterfactuals and VLM labels.

read the letter

The main thing to know is that YoCausal applies denoising loss on reversed real videos to measure arrow-of-time sensitivity (RSI) and then uses a VLM to split videos into causal versus non-causal groups (CCI), claiming the two do not align in 13 tested models.

The new elements are the zero-cost reversal protocol on real footage and the two-level split that tries to move past synthetic data. Applying this to multiple diffusion models and linking it to the Violation of Expectation idea gives a concrete way to probe the issue.

The soft spots are in the core premises. Reversing videos breaks entropy, gravity, and other processes together, so the RSI may capture general physics violation detection rather than targeted causal reasoning. The CCI depends on an external VLM for stratification with no reported calibration to human judgments or formal causal criteria, which leaves room for the index to track VLM artifacts instead.

The abstract gives no dataset sizes, CCI computation details, or error bars, making the size of the claimed gap to humans hard to assess. If the full paper adds those controls and shows the dissociation survives them, the result would be sharper.

This is for groups building video models for robotics or simulation who need causality checks. Readers can take the protocol as a starting point but should add their own validation.

It deserves peer review so the authors can strengthen the counterfactual and labeling steps.

Referee Report

3 major / 1 minor

Summary. The paper introduces YoCausal, a two-level benchmark for evaluating causal understanding in video diffusion models (VDMs) inspired by the Violation of Expectation paradigm. It treats temporally reversed real-world videos as natural counterfactual samples, defines the Reverse Surprise Index (RSI) to quantify arrow-of-time perception via denoising loss, and the Causality Cognition Index (CCI) via VLM-based stratification into causal vs. non-causal subsets. Evaluation across 13 state-of-the-art VDMs concludes that arrow-of-time perception does not imply causal understanding and that a significant gap remains relative to human causal cognition.

Significance. If the premises hold, this provides a scalable, real-world, zero-cost protocol for disentangling temporal bias from causal reasoning in generative video models, extending cognitive science methods to assess progress toward world models. It offers falsifiable indices and highlights a dissociation that could guide future VDM development.

major comments (3)

[Abstract] Abstract: Treating temporally reversed videos as 'natural counterfactual samples' is load-bearing for the central dissociation claim, yet reversal simultaneously violates multiple irreversible processes (entropy, gravity, friction) without corresponding to a targeted do-intervention or single-cause counterfactual in a causal graph; this risks conflating general physics-violation detection with causal reasoning.
[Abstract] Abstract: CCI relies on an off-the-shelf VLM to partition videos into causal vs. non-causal subsets with no reported calibration against human judgments or formal causal criteria; without this, the reported gap between RSI and CCI may reflect VLM annotation artifacts rather than VDM causal understanding.
[Abstract] Abstract: The evaluation on 13 VDMs reports a dissociation and gap to humans but supplies no quantitative details on CCI computation, error bars, dataset sizes, or controls for VLM bias, preventing verification that the data support the central claim.

minor comments (1)

The abstract would benefit from explicit references to causal inference literature (e.g., Pearl's do-calculus) and prior VoE implementations to situate the protocol.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below with clarifications on our methodology and indicate planned revisions where appropriate.

read point-by-point responses

Referee: Treating temporally reversed videos as 'natural counterfactual samples' is load-bearing for the central dissociation claim, yet reversal simultaneously violates multiple irreversible processes (entropy, gravity, friction) without corresponding to a targeted do-intervention or single-cause counterfactual in a causal graph; this risks conflating general physics-violation detection with causal reasoning.

Authors: We agree that time reversal is not a targeted do-intervention on a single causal variable. Our method draws directly from the Violation of Expectation paradigm, using reversal to create scalable, real-world violations of expected physical dynamics rather than precise graph interventions. RSI quantifies detection of such violations as a necessary (but not sufficient) component of causal perception. We will revise the abstract and method sections to describe these as 'approximate natural counterfactuals' to prevent overstatement. revision: partial
Referee: CCI relies on an off-the-shelf VLM to partition videos into causal vs. non-causal subsets with no reported calibration against human judgments or formal causal criteria; without this, the reported gap between RSI and CCI may reflect VLM annotation artifacts rather than VDM causal understanding.

Authors: The concern is valid. The current manuscript applies an off-the-shelf VLM with prompts targeting agent-driven cause-effect relations but does not report human calibration. We will add a human validation study on a data subset, report agreement metrics, and include the exact stratification prompts and criteria in the revised version. revision: yes
Referee: The evaluation on 13 VDMs reports a dissociation and gap to humans but supplies no quantitative details on CCI computation, error bars, dataset sizes, or controls for VLM bias, preventing verification that the data support the central claim.

Authors: The full manuscript contains dataset sizes, the CCI formula, and per-model results. We agree that error bars, explicit dataset statistics, and VLM bias controls are insufficiently detailed. We will add a table with video counts, standard errors across VLM runs, and a discussion of bias mitigation in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark metrics are external evaluations, not self-derived

full rationale

The paper introduces RSI (denoising loss on time-reversed videos) and CCI (VLM-based stratification) purely as evaluation protocols applied to existing VDMs. Neither metric is obtained by fitting parameters to the target result, nor does any central claim reduce to a self-citation chain or definitional equivalence. The reported dissociation between arrow-of-time perception and causal understanding follows directly from applying these independent indices to 13 models; no derivation step equates output to input by construction. This is a standard empirical benchmark paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that video reversal supplies valid counterfactuals and that VLM-based stratification isolates causal reasoning; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Temporally reversing real-world videos at zero cost produces natural counterfactual samples suitable for testing causality
Invoked to justify the core evaluation protocol in both Level 1 and Level 2.
domain assumption A VLM can accurately stratify videos into causal and non-causal subsets to disentangle causal reasoning from temporal bias
Required for the CCI to isolate genuine causality understanding.

pith-pipeline@v0.9.1-grok · 5724 in / 1334 out tokens · 37547 ms · 2026-06-29T08:24:31.341787+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

134 extracted references · 61 canonical work pages · 35 internal anchors

[1]

Abdi et al

H. Abdi et al. The kendall rank correlation coefficient.Encyclopedia of measurement and statistics, 2:508–510, 2007

2007
[2]

Cosmos World Foundation Model Platform for Physical AI

N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P . Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

T. Ates, M. Ate¸ so˘ glu, Ç. Yi˘ git, I. Kesen, M. Kobas, E. Erdem, A. Erdem, T. Goksun, and D. Yuret. Craft: A benchmark for causal reasoning about forces and interactions. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2602–2627, 2022

2022
[4]

Z. Bai, H. Ci, and M. Z. Shou. Impossible videos.arXiv preprint arXiv:2503.14378, 2025

work page arXiv 2025
[5]

Baillargeon

R. Baillargeon. Infants’ physical world.Current directions in psychological science, 13(3):89–94, 2004

2004
[6]

Baillargeon, E

R. Baillargeon, E. S. Spelke, and S. Wasserman. Object permanence in five-month-old infants.Cognition, 20(3):191–208, 1985

1985
[7]

VideoPhy: Evaluating Physical Commonsense for Video Generation

H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y. Bitton, C. Jiang, Y. Sun, K.-W. Chang, and A. Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

H. Bansal, C. Peng, Y. Bitton, R. Goldenberg, A. Grover, and K.-W. Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

work page arXiv 2025
[9]

Bar-Tal, H

O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat, J. Hur, G. Liu, A. Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

2024
[10]

Baradel, N

F. Baradel, N. Neverova, J. Mille, G. Mori, and C. Wolf. Cophy: Counterfactual learning of physical dynamics.arXiv preprint arXiv:1909.12000, 2019

work page arXiv 1909
[11]

P . W. Battaglia, J. B. Hamrick, and J. B. Tenenbaum. Simulation as an engine of physical scene understanding.Proceedings of the national academy of sciences, 110(45):18327–18332, 2013

2013
[12]

D. M. Bear, E. Wang, D. Mrowca, F. J. Binder, H.-Y. F. Tung, R. Pramod, C. Holdaway, S. Tao, K. Smith, F.-Y. Sun, et al. Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261, 2021

work page arXiv 2021
[13]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V . Voleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. 23 Figure A.5 Scaling laws and generational trends in causal cognition.Aggregate causal cognition rank correlates positively with both release date ( ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Blattmann, R

A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

2023
[15]

Bordes, Q

F. Bordes, Q. Garrido, J. T. Kao, A. Williams, M. Rabbat, and E. Dupoux. Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments.arXiv preprint arXiv:2506.09849, 2025

work page arXiv 2025
[16]

Brooks, B

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024
[17]

Bruce, M

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024
[18]

M. Cai, R. Tan, J. Zhang, B. Zou, K. Zhang, F. Yao, F. Zhu, J. Gu, Y. Zhong, Y. Shang, et al. Tempo- ralbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024

work page arXiv 2024
[19]

Chandrasegaran, A

K. Chandrasegaran, A. Gupta, L. M. Hadzic, T. Kota, J. He, C. Eyzaguirre, Z. Durante, M. Li, J. Wu, and L. Fei-Fei. Hourvideo: 1-hour video-language understanding.Advances in Neural Information Processing Systems, 37:53168–53197, 2024

2024
[20]

Chao, W.-F

C.-H. Chao, W.-F. Sun, B.-W. Cheng, Y.-C. Lo, C.-C. Chang, Y.-L. Liu, Y.-L. Chang, C.-P . Chen, and C.-Y. Lee. Denoising likelihood score matching for conditional score-based data generation.arXiv preprint arXiv:2203.14206, 2022

work page arXiv 2022
[21]

Y. Chen, J. Liu, X. Lin, and R. Tang. Countervqa: Evaluating and improving counterfactual reasoning in vision-language models for video understanding.arXiv preprint arXiv:2511.19923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

H. Chi, H. Li, W. Yang, F. Liu, L. Lan, X. Ren, T. Liu, and B. Han. Unveiling causal reasoning in large language models: Reality or mirage?Advances in Neural Information Processing Systems, 37:96640–96670, 2024

2024
[23]

Clark and P

K. Clark and P . Jaini. Text-to-image diffusion models are zero shot classifiers.Advances in Neural Information Processing Systems, 36:58921–58937, 2023

2023
[24]

Cores, M

D. Cores, M. Dorkenwald, M. Mucientes, C. G. Snoek, and Y. M. Asano. Tvbench: Redesigning video-language evaluation. 2024

2024
[25]

Croitoru, V

F.-A. Croitoru, V . Hondru, R. T. Ionescu, and M. Shah. Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10850–10869, 2023

2023
[26]

Dasgupta, J

A. Dasgupta, J. Duan, M. H. Ang Jr, and C. Tan. Avoe: a synthetic 3d dataset on understanding violation of expectation for artificial cognition.arXiv preprint arXiv:2110.05836, 2021

work page arXiv 2021
[27]

Didelez and I

V . Didelez and I. Pigeot. Causality: models, reasoning, and inference, 2001

2001
[28]

Y. Du, M. Yang, P . Florence, F. Xia, A. Wahid, B. Ichter, P . Sermanet, T. Yu, P . Abbeel, J. B. Tenenbaum, 24 et al. Video language planning.arXiv preprint arXiv:2310.10625, 2023

work page arXiv 2023
[29]

Dummett.Principles of electoral reform

M. Dummett.Principles of electoral reform. Oxford University Press, 1997

1997
[30]

P . Emerson. The original borda count and partial voting.Social Choice and Welfare, 40(2):353–358, 2013

2013
[31]

Esser, J

P . Esser, J. Chiu, P . Atighehchian, J. Granskog, and A. Germanidis. Structure and content-guided video synthesis with diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 7346–7356, 2023

2023
[32]

A. Foss, C. Evans, S. Mitts, K. Sinha, A. Rizvi, and J. T. Kao. Causalvqa: A physically grounded causal reasoning benchmark for video models.arXiv preprint arXiv:2506.09943, 2025

work page arXiv 2025
[33]

C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y. Shen, M. Zhang, et al. Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

2025
[34]

Gandhi, G

K. Gandhi, G. Stojnic, B. M. Lake, and M. R. Dillon. Baby intuitions benchmark (bib): Discerning the goals, preferences, and actions of others.Advances in neural information processing systems, 34:9963–9976, 2021

2021
[35]

Garrido, N

Q. Garrido, N. Ballas, M. Assran, A. Bardes, L. Najman, M. Rabbat, E. Dupoux, and Y. LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025

work page arXiv 2025
[36]

S. Ge, A. Mahapatra, G. Parmar, J.-Y. Zhu, and J.-B. Huang. On the content bias in fréchet video distance. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7277–7288, 2024

2024
[37]

Girdhar, M

R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra. Factorizing text-to-video generation by explicit image conditioning. InEuropean Conference on Computer Vision, pages 205–224. Springer, 2024

2024
[38]

Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, and B. Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Gupta, L

A. Gupta, L. Yu, K. Sohn, X. Gu, M. Hahn, F.-F. Li, I. Essa, L. Jiang, and J. Lezama. Photorealistic video generation with diffusion models. InEuropean Conference on Computer Vision, pages 393–411. Springer, 2024

2024
[40]

World Models

D. Ha and J. Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[41]

LTX-Video: Realtime Video Latent Diffusion

Y. HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Hafner, T

D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

2019
[43]

Mastering Atari with Discrete World Models

D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[44]

Hanyu, K

N. Hanyu, K. Watanabe, and S. Kitazawa. Ready to detect a reversal of time’s arrow: a psychophysical study using short video clips in daily scenes.Royal Society open science, 10(4), 2023

2023
[45]

J. Ho, A. Jain, and P . Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[46]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. Advances in neural information processing systems, 35:8633–8646, 2022

2022
[48]

W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Huang, Y

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[51]

Huang, F

Z. Huang, F. Zhang, X. Xu, Y. He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y. Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 25

2025
[52]

GPT-4o System Card

A. Hurst, A. Lerer, A. P . Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Z. Jin, Y. Chen, F. Leeb, L. Gresele, O. Kamal, Z. Lyu, K. Blin, F. Gonzalez Adauto, M. Kleiman-Weiner, M. Sachan, et al. Cladder: Assessing causal reasoning in language models.Advances in Neural Information Processing Systems, 36:31038–31065, 2023

2023
[54]

Z. Jin, J. Liu, Z. Lyu, S. Poff, M. Sachan, R. Mihalcea, M. Diab, and B. Schölkopf. Can large language models infer causation from correlation?arXiv preprint arXiv:2306.05836, 2023

work page arXiv 2023
[55]

B. Kang, Y. Yue, R. Lu, Z. Lin, Y. Zhao, K. Wang, G. Huang, and J. Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[57]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P . Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[58]

Kiciman, R

E. Kiciman, R. Ness, A. Sharma, and C. Tan. Causal reasoning and large language models: Opening a new frontier for causality.Transactions on Machine Learning Research, 2023

2023
[59]

Kingma, T

D. Kingma, T. Salimans, B. Poole, and J. Ho. Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021

2021
[60]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuan- video: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like people.Behavioral and brain sciences, 40:e253, 2017

2017
[63]

D. Layzer. The arrow of time.Scientific American, 233(6):56–69, 1975

1975
[64]

LeCun et al

Y. LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022

2022
[65]

A. M. Leslie and S. Keeble. Do six-month-old infants perceive causality?Cognition, 25(3):265–288, 1987

1987
[66]

A. C. Li, M. Prabhudesai, S. Duggal, E. Brown, and D. Pathak. Your diffusion model is secretly a zero-shot classifier. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2206–2217, 2023

2023
[67]

C. Li, O. Michel, X. Pan, S. Liu, M. Roberts, and S. Xie. Pisa experiments: Exploring physics post-training for video diffusion models by watching stuff drop.arXiv preprint arXiv:2503.09595, 2025

work page arXiv 2025
[68]

D. Li, Y. Fang, Y. Chen, S. Yang, S. Cao, J. Wong, M. Luo, X. Wang, H. Yin, J. E. Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

work page arXiv 2025
[69]

J. Li, L. Niu, and L. Zhang. From representation to reasoning: Towards both evidence and common- sense reasoning for video question-answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21273–21282, 2022

2022
[70]

K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, Y. Liu, Z. Wang, J. Xu, G. Chen, P . Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

2024
[71]

S. Li, L. Li, Y. Liu, S. Ren, Y. Liu, R. Gao, X. Sun, and L. Hou. Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models. InEuropean Conference on Computer Vision, pages 331–348. Springer, 2024

2024
[72]

Y. Li, W. Tian, Y. Jiao, J. Chen, and Y.-G. Jiang. Eyes can deceive: Benchmarking counterfactual reasoning abilities of multi-modal large language models.arXiv preprint arXiv:2404.12966, 3, 2024

work page arXiv 2024
[73]

Liang, H

Z. Liang, H. He, C. Yang, and B. Dai. Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024

work page arXiv 2024
[74]

J. Lin, Y. Du, O. Watkins, D. Hafner, P . Abbeel, D. Klein, and A. Dragan. Learning to model the world with language.arXiv preprint arXiv:2308.01399, 2023

work page arXiv 2023
[75]

X. Liu, Z. Xu, M. Li, K. Wang, Y. J. Lee, and Y. Shang. Can world simulators reason? gen-vire: A generative visual reasoning benchmark.arXiv preprint arXiv:2511.13853, 2025. 26

work page arXiv 2025
[76]

Y. Liu, S. Li, Y. Liu, Y. Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

2024
[77]

Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang, H. Sun, J. Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[78]

Margoni, L

F. Margoni, L. Surian, and R. Baillargeon. The violation-of-expectation paradigm: A conceptual overview.Psychological Review, 131(3):716, 2024

2024
[79]

Matsuo, Y

Y. Matsuo, Y. LeCun, M. Sahani, D. Precup, D. Silver, M. Sugiyama, E. Uchibe, and J. Morimoto. Deep learning, reinforcement learning, and world models.Neural Networks, 152:267–275, 2022

2022
[80]

R. P . McDonald. Judea pearl. causality: Models, reasoning, and inference. cambridge: Cambridge university press. 384 pp., 2000, isbn 0521773628.Psychometrika, 67(2):321–322, 2002

2000

Showing first 80 references.

[1] [1]

Abdi et al

H. Abdi et al. The kendall rank correlation coefficient.Encyclopedia of measurement and statistics, 2:508–510, 2007

2007

[2] [2]

Cosmos World Foundation Model Platform for Physical AI

N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P . Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

T. Ates, M. Ate¸ so˘ glu, Ç. Yi˘ git, I. Kesen, M. Kobas, E. Erdem, A. Erdem, T. Goksun, and D. Yuret. Craft: A benchmark for causal reasoning about forces and interactions. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2602–2627, 2022

2022

[4] [4]

Z. Bai, H. Ci, and M. Z. Shou. Impossible videos.arXiv preprint arXiv:2503.14378, 2025

work page arXiv 2025

[5] [5]

Baillargeon

R. Baillargeon. Infants’ physical world.Current directions in psychological science, 13(3):89–94, 2004

2004

[6] [6]

Baillargeon, E

R. Baillargeon, E. S. Spelke, and S. Wasserman. Object permanence in five-month-old infants.Cognition, 20(3):191–208, 1985

1985

[7] [7]

VideoPhy: Evaluating Physical Commonsense for Video Generation

H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y. Bitton, C. Jiang, Y. Sun, K.-W. Chang, and A. Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

H. Bansal, C. Peng, Y. Bitton, R. Goldenberg, A. Grover, and K.-W. Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

work page arXiv 2025

[9] [9]

Bar-Tal, H

O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat, J. Hur, G. Liu, A. Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

2024

[10] [10]

Baradel, N

F. Baradel, N. Neverova, J. Mille, G. Mori, and C. Wolf. Cophy: Counterfactual learning of physical dynamics.arXiv preprint arXiv:1909.12000, 2019

work page arXiv 1909

[11] [11]

P . W. Battaglia, J. B. Hamrick, and J. B. Tenenbaum. Simulation as an engine of physical scene understanding.Proceedings of the national academy of sciences, 110(45):18327–18332, 2013

2013

[12] [12]

D. M. Bear, E. Wang, D. Mrowca, F. J. Binder, H.-Y. F. Tung, R. Pramod, C. Holdaway, S. Tao, K. Smith, F.-Y. Sun, et al. Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261, 2021

work page arXiv 2021

[13] [13]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V . Voleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. 23 Figure A.5 Scaling laws and generational trends in causal cognition.Aggregate causal cognition rank correlates positively with both release date ( ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Blattmann, R

A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

2023

[15] [15]

Bordes, Q

F. Bordes, Q. Garrido, J. T. Kao, A. Williams, M. Rabbat, and E. Dupoux. Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments.arXiv preprint arXiv:2506.09849, 2025

work page arXiv 2025

[16] [16]

Brooks, B

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024

[17] [17]

Bruce, M

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024

[18] [18]

M. Cai, R. Tan, J. Zhang, B. Zou, K. Zhang, F. Yao, F. Zhu, J. Gu, Y. Zhong, Y. Shang, et al. Tempo- ralbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024

work page arXiv 2024

[19] [19]

Chandrasegaran, A

K. Chandrasegaran, A. Gupta, L. M. Hadzic, T. Kota, J. He, C. Eyzaguirre, Z. Durante, M. Li, J. Wu, and L. Fei-Fei. Hourvideo: 1-hour video-language understanding.Advances in Neural Information Processing Systems, 37:53168–53197, 2024

2024

[20] [20]

Chao, W.-F

C.-H. Chao, W.-F. Sun, B.-W. Cheng, Y.-C. Lo, C.-C. Chang, Y.-L. Liu, Y.-L. Chang, C.-P . Chen, and C.-Y. Lee. Denoising likelihood score matching for conditional score-based data generation.arXiv preprint arXiv:2203.14206, 2022

work page arXiv 2022

[21] [21]

Y. Chen, J. Liu, X. Lin, and R. Tang. Countervqa: Evaluating and improving counterfactual reasoning in vision-language models for video understanding.arXiv preprint arXiv:2511.19923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

H. Chi, H. Li, W. Yang, F. Liu, L. Lan, X. Ren, T. Liu, and B. Han. Unveiling causal reasoning in large language models: Reality or mirage?Advances in Neural Information Processing Systems, 37:96640–96670, 2024

2024

[23] [23]

Clark and P

K. Clark and P . Jaini. Text-to-image diffusion models are zero shot classifiers.Advances in Neural Information Processing Systems, 36:58921–58937, 2023

2023

[24] [24]

Cores, M

D. Cores, M. Dorkenwald, M. Mucientes, C. G. Snoek, and Y. M. Asano. Tvbench: Redesigning video-language evaluation. 2024

2024

[25] [25]

Croitoru, V

F.-A. Croitoru, V . Hondru, R. T. Ionescu, and M. Shah. Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10850–10869, 2023

2023

[26] [26]

Dasgupta, J

A. Dasgupta, J. Duan, M. H. Ang Jr, and C. Tan. Avoe: a synthetic 3d dataset on understanding violation of expectation for artificial cognition.arXiv preprint arXiv:2110.05836, 2021

work page arXiv 2021

[27] [27]

Didelez and I

V . Didelez and I. Pigeot. Causality: models, reasoning, and inference, 2001

2001

[28] [28]

Y. Du, M. Yang, P . Florence, F. Xia, A. Wahid, B. Ichter, P . Sermanet, T. Yu, P . Abbeel, J. B. Tenenbaum, 24 et al. Video language planning.arXiv preprint arXiv:2310.10625, 2023

work page arXiv 2023

[29] [29]

Dummett.Principles of electoral reform

M. Dummett.Principles of electoral reform. Oxford University Press, 1997

1997

[30] [30]

P . Emerson. The original borda count and partial voting.Social Choice and Welfare, 40(2):353–358, 2013

2013

[31] [31]

Esser, J

P . Esser, J. Chiu, P . Atighehchian, J. Granskog, and A. Germanidis. Structure and content-guided video synthesis with diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 7346–7356, 2023

2023

[32] [32]

A. Foss, C. Evans, S. Mitts, K. Sinha, A. Rizvi, and J. T. Kao. Causalvqa: A physically grounded causal reasoning benchmark for video models.arXiv preprint arXiv:2506.09943, 2025

work page arXiv 2025

[33] [33]

C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y. Shen, M. Zhang, et al. Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

2025

[34] [34]

Gandhi, G

K. Gandhi, G. Stojnic, B. M. Lake, and M. R. Dillon. Baby intuitions benchmark (bib): Discerning the goals, preferences, and actions of others.Advances in neural information processing systems, 34:9963–9976, 2021

2021

[35] [35]

Garrido, N

Q. Garrido, N. Ballas, M. Assran, A. Bardes, L. Najman, M. Rabbat, E. Dupoux, and Y. LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025

work page arXiv 2025

[36] [36]

S. Ge, A. Mahapatra, G. Parmar, J.-Y. Zhu, and J.-B. Huang. On the content bias in fréchet video distance. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7277–7288, 2024

2024

[37] [37]

Girdhar, M

R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra. Factorizing text-to-video generation by explicit image conditioning. InEuropean Conference on Computer Vision, pages 205–224. Springer, 2024

2024

[38] [38]

Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, and B. Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Gupta, L

A. Gupta, L. Yu, K. Sohn, X. Gu, M. Hahn, F.-F. Li, I. Essa, L. Jiang, and J. Lezama. Photorealistic video generation with diffusion models. InEuropean Conference on Computer Vision, pages 393–411. Springer, 2024

2024

[40] [40]

World Models

D. Ha and J. Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[41] [41]

LTX-Video: Realtime Video Latent Diffusion

Y. HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Hafner, T

D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

2019

[43] [43]

Mastering Atari with Discrete World Models

D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[44] [44]

Hanyu, K

N. Hanyu, K. Watanabe, and S. Kitazawa. Ready to detect a reversal of time’s arrow: a psychophysical study using short video clips in daily scenes.Royal Society open science, 10(4), 2023

2023

[45] [45]

J. Ho, A. Jain, and P . Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[46] [46]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[47] [47]

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. Advances in neural information processing systems, 35:8633–8646, 2022

2022

[48] [48]

W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[49] [49]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Huang, Y

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024

[51] [51]

Huang, F

Z. Huang, F. Zhang, X. Xu, Y. He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y. Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 25

2025

[52] [52]

GPT-4o System Card

A. Hurst, A. Lerer, A. P . Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Z. Jin, Y. Chen, F. Leeb, L. Gresele, O. Kamal, Z. Lyu, K. Blin, F. Gonzalez Adauto, M. Kleiman-Weiner, M. Sachan, et al. Cladder: Assessing causal reasoning in language models.Advances in Neural Information Processing Systems, 36:31038–31065, 2023

2023

[54] [54]

Z. Jin, J. Liu, Z. Lyu, S. Poff, M. Sachan, R. Mihalcea, M. Diab, and B. Schölkopf. Can large language models infer causation from correlation?arXiv preprint arXiv:2306.05836, 2023

work page arXiv 2023

[55] [55]

B. Kang, Y. Yue, R. Lu, Z. Lin, Y. Zhao, K. Wang, G. Huang, and J. Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[57] [57]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P . Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[58] [58]

Kiciman, R

E. Kiciman, R. Ness, A. Sharma, and C. Tan. Causal reasoning and large language models: Opening a new frontier for causality.Transactions on Machine Learning Research, 2023

2023

[59] [59]

Kingma, T

D. Kingma, T. Salimans, B. Poole, and J. Ho. Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021

2021

[60] [60]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuan- video: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[62] [62]

B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like people.Behavioral and brain sciences, 40:e253, 2017

2017

[63] [63]

D. Layzer. The arrow of time.Scientific American, 233(6):56–69, 1975

1975

[64] [64]

LeCun et al

Y. LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022

2022

[65] [65]

A. M. Leslie and S. Keeble. Do six-month-old infants perceive causality?Cognition, 25(3):265–288, 1987

1987

[66] [66]

A. C. Li, M. Prabhudesai, S. Duggal, E. Brown, and D. Pathak. Your diffusion model is secretly a zero-shot classifier. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2206–2217, 2023

2023

[67] [67]

C. Li, O. Michel, X. Pan, S. Liu, M. Roberts, and S. Xie. Pisa experiments: Exploring physics post-training for video diffusion models by watching stuff drop.arXiv preprint arXiv:2503.09595, 2025

work page arXiv 2025

[68] [68]

D. Li, Y. Fang, Y. Chen, S. Yang, S. Cao, J. Wong, M. Luo, X. Wang, H. Yin, J. E. Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

work page arXiv 2025

[69] [69]

J. Li, L. Niu, and L. Zhang. From representation to reasoning: Towards both evidence and common- sense reasoning for video question-answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21273–21282, 2022

2022

[70] [70]

K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, Y. Liu, Z. Wang, J. Xu, G. Chen, P . Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

2024

[71] [71]

S. Li, L. Li, Y. Liu, S. Ren, Y. Liu, R. Gao, X. Sun, and L. Hou. Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models. InEuropean Conference on Computer Vision, pages 331–348. Springer, 2024

2024

[72] [72]

Y. Li, W. Tian, Y. Jiao, J. Chen, and Y.-G. Jiang. Eyes can deceive: Benchmarking counterfactual reasoning abilities of multi-modal large language models.arXiv preprint arXiv:2404.12966, 3, 2024

work page arXiv 2024

[73] [73]

Liang, H

Z. Liang, H. He, C. Yang, and B. Dai. Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024

work page arXiv 2024

[74] [74]

J. Lin, Y. Du, O. Watkins, D. Hafner, P . Abbeel, D. Klein, and A. Dragan. Learning to model the world with language.arXiv preprint arXiv:2308.01399, 2023

work page arXiv 2023

[75] [75]

X. Liu, Z. Xu, M. Li, K. Wang, Y. J. Lee, and Y. Shang. Can world simulators reason? gen-vire: A generative visual reasoning benchmark.arXiv preprint arXiv:2511.13853, 2025. 26

work page arXiv 2025

[76] [76]

Y. Liu, S. Li, Y. Liu, Y. Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

2024

[77] [77]

Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang, H. Sun, J. Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[78] [78]

Margoni, L

F. Margoni, L. Surian, and R. Baillargeon. The violation-of-expectation paradigm: A conceptual overview.Psychological Review, 131(3):716, 2024

2024

[79] [79]

Matsuo, Y

Y. Matsuo, Y. LeCun, M. Sahani, D. Precup, D. Silver, M. Sugiyama, E. Uchibe, and J. Morimoto. Deep learning, reinforcement learning, and world models.Neural Networks, 152:267–275, 2022

2022

[80] [80]

R. P . McDonald. Judea pearl. causality: Models, reasoning, and inference. cambridge: Cambridge university press. 384 pp., 2000, isbn 0521773628.Psychometrika, 67(2):321–322, 2002

2000