pith. sign in

arxiv: 2605.30346 · v1 · pith:N5A4EID6new · submitted 2026-05-28 · 💻 cs.CV

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Pith reviewed 2026-06-29 08:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords video diffusion modelscausalityworld modelsbenchmarkcounterfactualsarrow of timeviolation of expectation
0
0 comments X

The pith

Video diffusion models notice when time runs backward but do not grasp cause and effect like humans do.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether video diffusion models function as world models by checking if they understand causality or only statistical timing patterns. It introduces YoCausal, which reverses real videos at no cost to create natural counterfactual examples and measures both time-direction awareness and causal reasoning separately. On 13 current models, time-direction detection turns out to be unrelated to actual causal understanding, leaving a clear shortfall compared with human performance. This distinction matters because models positioned as simulators of the physical world need reliable cause-effect reasoning to predict what happens next in new situations.

Core claim

YoCausal is a two-level benchmark that first quantifies arrow-of-time perception through a Reverse Surprise Index based on denoising loss when videos are played backward, then applies a Causality Cognition Index that uses a vision-language model to split videos into causal and non-causal groups. Evaluation across 13 state-of-the-art video diffusion models shows that strong performance on the first index does not produce strong performance on the second, revealing that temporal pattern recognition alone does not deliver causal cognition and that current models remain far from human levels on real-world videos.

What carries the argument

YoCausal benchmark that creates natural counterfactuals by temporally reversing real videos, then computes Reverse Surprise Index for time-direction sensitivity and Causality Cognition Index to isolate genuine causal reasoning from temporal bias.

If this is right

  • Models can detect time reversal without acquiring causal reasoning.
  • Synthetic-data benchmarks may overlook real-world causal failures.
  • Current video diffusion models fall short of human causal cognition.
  • The two-level protocol can be extended to new models at low cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improving the Causality Cognition Index could lead models to generate more physically consistent future frames.
  • The same reversal technique might expose causal gaps in other generative domains such as audio or 3D scenes.
  • Explicit causal objectives beyond standard diffusion training may be needed to close the human gap.

Load-bearing premise

Reversing real-world videos produces valid natural counterfactual samples, and a vision-language model can accurately separate causal from non-causal videos.

What would settle it

A model that scores equally on causal and non-causal subsets in the Causality Cognition Index or reaches human-level scores on both indices would contradict the reported gap between time perception and causal understanding.

read the original abstract

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces YoCausal, a two-level benchmark for evaluating causal understanding in video diffusion models (VDMs) inspired by the Violation of Expectation paradigm. It treats temporally reversed real-world videos as natural counterfactual samples, defines the Reverse Surprise Index (RSI) to quantify arrow-of-time perception via denoising loss, and the Causality Cognition Index (CCI) via VLM-based stratification into causal vs. non-causal subsets. Evaluation across 13 state-of-the-art VDMs concludes that arrow-of-time perception does not imply causal understanding and that a significant gap remains relative to human causal cognition.

Significance. If the premises hold, this provides a scalable, real-world, zero-cost protocol for disentangling temporal bias from causal reasoning in generative video models, extending cognitive science methods to assess progress toward world models. It offers falsifiable indices and highlights a dissociation that could guide future VDM development.

major comments (3)
  1. [Abstract] Abstract: Treating temporally reversed videos as 'natural counterfactual samples' is load-bearing for the central dissociation claim, yet reversal simultaneously violates multiple irreversible processes (entropy, gravity, friction) without corresponding to a targeted do-intervention or single-cause counterfactual in a causal graph; this risks conflating general physics-violation detection with causal reasoning.
  2. [Abstract] Abstract: CCI relies on an off-the-shelf VLM to partition videos into causal vs. non-causal subsets with no reported calibration against human judgments or formal causal criteria; without this, the reported gap between RSI and CCI may reflect VLM annotation artifacts rather than VDM causal understanding.
  3. [Abstract] Abstract: The evaluation on 13 VDMs reports a dissociation and gap to humans but supplies no quantitative details on CCI computation, error bars, dataset sizes, or controls for VLM bias, preventing verification that the data support the central claim.
minor comments (1)
  1. The abstract would benefit from explicit references to causal inference literature (e.g., Pearl's do-calculus) and prior VoE implementations to situate the protocol.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below with clarifications on our methodology and indicate planned revisions where appropriate.

read point-by-point responses
  1. Referee: Treating temporally reversed videos as 'natural counterfactual samples' is load-bearing for the central dissociation claim, yet reversal simultaneously violates multiple irreversible processes (entropy, gravity, friction) without corresponding to a targeted do-intervention or single-cause counterfactual in a causal graph; this risks conflating general physics-violation detection with causal reasoning.

    Authors: We agree that time reversal is not a targeted do-intervention on a single causal variable. Our method draws directly from the Violation of Expectation paradigm, using reversal to create scalable, real-world violations of expected physical dynamics rather than precise graph interventions. RSI quantifies detection of such violations as a necessary (but not sufficient) component of causal perception. We will revise the abstract and method sections to describe these as 'approximate natural counterfactuals' to prevent overstatement. revision: partial

  2. Referee: CCI relies on an off-the-shelf VLM to partition videos into causal vs. non-causal subsets with no reported calibration against human judgments or formal causal criteria; without this, the reported gap between RSI and CCI may reflect VLM annotation artifacts rather than VDM causal understanding.

    Authors: The concern is valid. The current manuscript applies an off-the-shelf VLM with prompts targeting agent-driven cause-effect relations but does not report human calibration. We will add a human validation study on a data subset, report agreement metrics, and include the exact stratification prompts and criteria in the revised version. revision: yes

  3. Referee: The evaluation on 13 VDMs reports a dissociation and gap to humans but supplies no quantitative details on CCI computation, error bars, dataset sizes, or controls for VLM bias, preventing verification that the data support the central claim.

    Authors: The full manuscript contains dataset sizes, the CCI formula, and per-model results. We agree that error bars, explicit dataset statistics, and VLM bias controls are insufficiently detailed. We will add a table with video counts, standard errors across VLM runs, and a discussion of bias mitigation in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark metrics are external evaluations, not self-derived

full rationale

The paper introduces RSI (denoising loss on time-reversed videos) and CCI (VLM-based stratification) purely as evaluation protocols applied to existing VDMs. Neither metric is obtained by fitting parameters to the target result, nor does any central claim reduce to a self-citation chain or definitional equivalence. The reported dissociation between arrow-of-time perception and causal understanding follows directly from applying these independent indices to 13 models; no derivation step equates output to input by construction. This is a standard empirical benchmark paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that video reversal supplies valid counterfactuals and that VLM-based stratification isolates causal reasoning; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Temporally reversing real-world videos at zero cost produces natural counterfactual samples suitable for testing causality
    Invoked to justify the core evaluation protocol in both Level 1 and Level 2.
  • domain assumption A VLM can accurately stratify videos into causal and non-causal subsets to disentangle causal reasoning from temporal bias
    Required for the CCI to isolate genuine causality understanding.

pith-pipeline@v0.9.1-grok · 5724 in / 1334 out tokens · 37547 ms · 2026-06-29T08:24:31.341787+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

134 extracted references · 61 canonical work pages · 35 internal anchors

  1. [1]

    Abdi et al

    H. Abdi et al. The kendall rank correlation coefficient.Encyclopedia of measurement and statistics, 2:508–510, 2007

  2. [2]

    Cosmos World Foundation Model Platform for Physical AI

    N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P . Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  3. [3]

    T. Ates, M. Ate¸ so˘ glu, Ç. Yi˘ git, I. Kesen, M. Kobas, E. Erdem, A. Erdem, T. Goksun, and D. Yuret. Craft: A benchmark for causal reasoning about forces and interactions. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2602–2627, 2022

  4. [4]

    Z. Bai, H. Ci, and M. Z. Shou. Impossible videos.arXiv preprint arXiv:2503.14378, 2025

  5. [5]

    Baillargeon

    R. Baillargeon. Infants’ physical world.Current directions in psychological science, 13(3):89–94, 2004

  6. [6]

    Baillargeon, E

    R. Baillargeon, E. S. Spelke, and S. Wasserman. Object permanence in five-month-old infants.Cognition, 20(3):191–208, 1985

  7. [7]

    VideoPhy: Evaluating Physical Commonsense for Video Generation

    H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y. Bitton, C. Jiang, Y. Sun, K.-W. Chang, and A. Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024

  8. [8]

    Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

    H. Bansal, C. Peng, Y. Bitton, R. Goldenberg, A. Grover, and K.-W. Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

  9. [9]

    Bar-Tal, H

    O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat, J. Hur, G. Liu, A. Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

  10. [10]

    Baradel, N

    F. Baradel, N. Neverova, J. Mille, G. Mori, and C. Wolf. Cophy: Counterfactual learning of physical dynamics.arXiv preprint arXiv:1909.12000, 2019

  11. [11]

    P . W. Battaglia, J. B. Hamrick, and J. B. Tenenbaum. Simulation as an engine of physical scene understanding.Proceedings of the national academy of sciences, 110(45):18327–18332, 2013

  12. [12]

    D. M. Bear, E. Wang, D. Mrowca, F. J. Binder, H.-Y. F. Tung, R. Pramod, C. Holdaway, S. Tao, K. Smith, F.-Y. Sun, et al. Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261, 2021

  13. [13]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V . Voleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. 23 Figure A.5 Scaling laws and generational trends in causal cognition.Aggregate causal cognition rank correlates positively with both release date ( ...

  14. [14]

    Blattmann, R

    A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

  15. [15]

    Bordes, Q

    F. Bordes, Q. Garrido, J. T. Kao, A. Williams, M. Rabbat, and E. Dupoux. Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments.arXiv preprint arXiv:2506.09849, 2025

  16. [16]

    Brooks, B

    T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

  17. [17]

    Bruce, M

    J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  18. [18]

    M. Cai, R. Tan, J. Zhang, B. Zou, K. Zhang, F. Yao, F. Zhu, J. Gu, Y. Zhong, Y. Shang, et al. Tempo- ralbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024

  19. [19]

    Chandrasegaran, A

    K. Chandrasegaran, A. Gupta, L. M. Hadzic, T. Kota, J. He, C. Eyzaguirre, Z. Durante, M. Li, J. Wu, and L. Fei-Fei. Hourvideo: 1-hour video-language understanding.Advances in Neural Information Processing Systems, 37:53168–53197, 2024

  20. [20]

    Chao, W.-F

    C.-H. Chao, W.-F. Sun, B.-W. Cheng, Y.-C. Lo, C.-C. Chang, Y.-L. Liu, Y.-L. Chang, C.-P . Chen, and C.-Y. Lee. Denoising likelihood score matching for conditional score-based data generation.arXiv preprint arXiv:2203.14206, 2022

  21. [21]

    Y. Chen, J. Liu, X. Lin, and R. Tang. Countervqa: Evaluating and improving counterfactual reasoning in vision-language models for video understanding.arXiv preprint arXiv:2511.19923, 2025

  22. [22]

    H. Chi, H. Li, W. Yang, F. Liu, L. Lan, X. Ren, T. Liu, and B. Han. Unveiling causal reasoning in large language models: Reality or mirage?Advances in Neural Information Processing Systems, 37:96640–96670, 2024

  23. [23]

    Clark and P

    K. Clark and P . Jaini. Text-to-image diffusion models are zero shot classifiers.Advances in Neural Information Processing Systems, 36:58921–58937, 2023

  24. [24]

    Cores, M

    D. Cores, M. Dorkenwald, M. Mucientes, C. G. Snoek, and Y. M. Asano. Tvbench: Redesigning video-language evaluation. 2024

  25. [25]

    Croitoru, V

    F.-A. Croitoru, V . Hondru, R. T. Ionescu, and M. Shah. Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10850–10869, 2023

  26. [26]

    Dasgupta, J

    A. Dasgupta, J. Duan, M. H. Ang Jr, and C. Tan. Avoe: a synthetic 3d dataset on understanding violation of expectation for artificial cognition.arXiv preprint arXiv:2110.05836, 2021

  27. [27]

    Didelez and I

    V . Didelez and I. Pigeot. Causality: models, reasoning, and inference, 2001

  28. [28]

    Y. Du, M. Yang, P . Florence, F. Xia, A. Wahid, B. Ichter, P . Sermanet, T. Yu, P . Abbeel, J. B. Tenenbaum, 24 et al. Video language planning.arXiv preprint arXiv:2310.10625, 2023

  29. [29]

    Dummett.Principles of electoral reform

    M. Dummett.Principles of electoral reform. Oxford University Press, 1997

  30. [30]

    P . Emerson. The original borda count and partial voting.Social Choice and Welfare, 40(2):353–358, 2013

  31. [31]

    Esser, J

    P . Esser, J. Chiu, P . Atighehchian, J. Granskog, and A. Germanidis. Structure and content-guided video synthesis with diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 7346–7356, 2023

  32. [32]

    A. Foss, C. Evans, S. Mitts, K. Sinha, A. Rizvi, and J. T. Kao. Causalvqa: A physically grounded causal reasoning benchmark for video models.arXiv preprint arXiv:2506.09943, 2025

  33. [33]

    C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y. Shen, M. Zhang, et al. Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

  34. [34]

    Gandhi, G

    K. Gandhi, G. Stojnic, B. M. Lake, and M. R. Dillon. Baby intuitions benchmark (bib): Discerning the goals, preferences, and actions of others.Advances in neural information processing systems, 34:9963–9976, 2021

  35. [35]

    Garrido, N

    Q. Garrido, N. Ballas, M. Assran, A. Bardes, L. Najman, M. Rabbat, E. Dupoux, and Y. LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025

  36. [36]

    S. Ge, A. Mahapatra, G. Parmar, J.-Y. Zhu, and J.-B. Huang. On the content bias in fréchet video distance. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7277–7288, 2024

  37. [37]

    Girdhar, M

    R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra. Factorizing text-to-video generation by explicit image conditioning. InEuropean Conference on Computer Vision, pages 205–224. Springer, 2024

  38. [38]

    Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, and B. Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

  39. [39]

    Gupta, L

    A. Gupta, L. Yu, K. Sohn, X. Gu, M. Hahn, F.-F. Li, I. Essa, L. Jiang, and J. Lezama. Photorealistic video generation with diffusion models. InEuropean Conference on Computer Vision, pages 393–411. Springer, 2024

  40. [40]

    World Models

    D. Ha and J. Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

  41. [41]

    LTX-Video: Realtime Video Latent Diffusion

    Y. HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  42. [42]

    Hafner, T

    D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

  43. [43]

    Mastering Atari with Discrete World Models

    D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

  44. [44]

    Hanyu, K

    N. Hanyu, K. Watanabe, and S. Kitazawa. Ready to detect a reversal of time’s arrow: a psychophysical study using short video clips in daily scenes.Royal Society open science, 10(4), 2023

  45. [45]

    J. Ho, A. Jain, and P . Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  46. [46]

    Classifier-Free Diffusion Guidance

    J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  47. [47]

    J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. Advances in neural information processing systems, 35:8633–8646, 2022

  48. [48]

    W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

  49. [49]

    A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

  50. [50]

    Huang, Y

    Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  51. [51]

    Huang, F

    Z. Huang, F. Zhang, X. Xu, Y. He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y. Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 25

  52. [52]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P . Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  53. [53]

    Z. Jin, Y. Chen, F. Leeb, L. Gresele, O. Kamal, Z. Lyu, K. Blin, F. Gonzalez Adauto, M. Kleiman-Weiner, M. Sachan, et al. Cladder: Assessing causal reasoning in language models.Advances in Neural Information Processing Systems, 36:31038–31065, 2023

  54. [54]

    Z. Jin, J. Liu, Z. Lyu, S. Poff, M. Sachan, R. Mihalcea, M. Diab, and B. Schölkopf. Can large language models infer causation from correlation?arXiv preprint arXiv:2306.05836, 2023

  55. [55]

    B. Kang, Y. Yue, R. Lu, Z. Lin, Y. Zhao, K. Wang, G. Huang, and J. Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

  56. [56]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  57. [57]

    W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P . Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

  58. [58]

    Kiciman, R

    E. Kiciman, R. Ness, A. Sharma, and C. Tan. Causal reasoning and large language models: Opening a new frontier for causality.Transactions on Machine Learning Research, 2023

  59. [59]

    Kingma, T

    D. Kingma, T. Salimans, B. Poole, and J. Ho. Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021

  60. [60]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023

  61. [61]

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuan- video: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  62. [62]

    B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like people.Behavioral and brain sciences, 40:e253, 2017

  63. [63]

    D. Layzer. The arrow of time.Scientific American, 233(6):56–69, 1975

  64. [64]

    LeCun et al

    Y. LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022

  65. [65]

    A. M. Leslie and S. Keeble. Do six-month-old infants perceive causality?Cognition, 25(3):265–288, 1987

  66. [66]

    A. C. Li, M. Prabhudesai, S. Duggal, E. Brown, and D. Pathak. Your diffusion model is secretly a zero-shot classifier. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2206–2217, 2023

  67. [67]

    C. Li, O. Michel, X. Pan, S. Liu, M. Roberts, and S. Xie. Pisa experiments: Exploring physics post-training for video diffusion models by watching stuff drop.arXiv preprint arXiv:2503.09595, 2025

  68. [68]

    D. Li, Y. Fang, Y. Chen, S. Yang, S. Cao, J. Wong, M. Luo, X. Wang, H. Yin, J. E. Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

  69. [69]

    J. Li, L. Niu, and L. Zhang. From representation to reasoning: Towards both evidence and common- sense reasoning for video question-answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21273–21282, 2022

  70. [70]

    K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, Y. Liu, Z. Wang, J. Xu, G. Chen, P . Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

  71. [71]

    S. Li, L. Li, Y. Liu, S. Ren, Y. Liu, R. Gao, X. Sun, and L. Hou. Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models. InEuropean Conference on Computer Vision, pages 331–348. Springer, 2024

  72. [72]

    Y. Li, W. Tian, Y. Jiao, J. Chen, and Y.-G. Jiang. Eyes can deceive: Benchmarking counterfactual reasoning abilities of multi-modal large language models.arXiv preprint arXiv:2404.12966, 3, 2024

  73. [73]

    Liang, H

    Z. Liang, H. He, C. Yang, and B. Dai. Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024

  74. [74]

    J. Lin, Y. Du, O. Watkins, D. Hafner, P . Abbeel, D. Klein, and A. Dragan. Learning to model the world with language.arXiv preprint arXiv:2308.01399, 2023

  75. [75]

    X. Liu, Z. Xu, M. Li, K. Wang, Y. J. Lee, and Y. Shang. Can world simulators reason? gen-vire: A generative visual reasoning benchmark.arXiv preprint arXiv:2511.13853, 2025. 26

  76. [76]

    Y. Liu, S. Li, Y. Liu, Y. Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

  77. [77]

    Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang, H. Sun, J. Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

  78. [78]

    Margoni, L

    F. Margoni, L. Surian, and R. Baillargeon. The violation-of-expectation paradigm: A conceptual overview.Psychological Review, 131(3):716, 2024

  79. [79]

    Matsuo, Y

    Y. Matsuo, Y. LeCun, M. Sahani, D. Precup, D. Silver, M. Sugiyama, E. Uchibe, and J. Morimoto. Deep learning, reinforcement learning, and world models.Neural Networks, 152:267–275, 2022

  80. [80]

    R. P . McDonald. Judea pearl. causality: Models, reasoning, and inference. cambridge: Cambridge university press. 384 pp., 2000, isbn 0521773628.Psychometrika, 67(2):321–322, 2002

Showing first 80 references.