YoCausal: How Far is Video Generation from World Model? A Causality Perspective
Pith reviewed 2026-06-29 08:24 UTC · model grok-4.3
The pith
Video diffusion models notice when time runs backward but do not grasp cause and effect like humans do.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
YoCausal is a two-level benchmark that first quantifies arrow-of-time perception through a Reverse Surprise Index based on denoising loss when videos are played backward, then applies a Causality Cognition Index that uses a vision-language model to split videos into causal and non-causal groups. Evaluation across 13 state-of-the-art video diffusion models shows that strong performance on the first index does not produce strong performance on the second, revealing that temporal pattern recognition alone does not deliver causal cognition and that current models remain far from human levels on real-world videos.
What carries the argument
YoCausal benchmark that creates natural counterfactuals by temporally reversing real videos, then computes Reverse Surprise Index for time-direction sensitivity and Causality Cognition Index to isolate genuine causal reasoning from temporal bias.
If this is right
- Models can detect time reversal without acquiring causal reasoning.
- Synthetic-data benchmarks may overlook real-world causal failures.
- Current video diffusion models fall short of human causal cognition.
- The two-level protocol can be extended to new models at low cost.
Where Pith is reading between the lines
- Improving the Causality Cognition Index could lead models to generate more physically consistent future frames.
- The same reversal technique might expose causal gaps in other generative domains such as audio or 3D scenes.
- Explicit causal objectives beyond standard diffusion training may be needed to close the human gap.
Load-bearing premise
Reversing real-world videos produces valid natural counterfactual samples, and a vision-language model can accurately separate causal from non-causal videos.
What would settle it
A model that scores equally on causal and non-causal subsets in the Causality Cognition Index or reaches human-level scores on both indices would contradict the reported gap between time perception and causal understanding.
read the original abstract
As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces YoCausal, a two-level benchmark for evaluating causal understanding in video diffusion models (VDMs) inspired by the Violation of Expectation paradigm. It treats temporally reversed real-world videos as natural counterfactual samples, defines the Reverse Surprise Index (RSI) to quantify arrow-of-time perception via denoising loss, and the Causality Cognition Index (CCI) via VLM-based stratification into causal vs. non-causal subsets. Evaluation across 13 state-of-the-art VDMs concludes that arrow-of-time perception does not imply causal understanding and that a significant gap remains relative to human causal cognition.
Significance. If the premises hold, this provides a scalable, real-world, zero-cost protocol for disentangling temporal bias from causal reasoning in generative video models, extending cognitive science methods to assess progress toward world models. It offers falsifiable indices and highlights a dissociation that could guide future VDM development.
major comments (3)
- [Abstract] Abstract: Treating temporally reversed videos as 'natural counterfactual samples' is load-bearing for the central dissociation claim, yet reversal simultaneously violates multiple irreversible processes (entropy, gravity, friction) without corresponding to a targeted do-intervention or single-cause counterfactual in a causal graph; this risks conflating general physics-violation detection with causal reasoning.
- [Abstract] Abstract: CCI relies on an off-the-shelf VLM to partition videos into causal vs. non-causal subsets with no reported calibration against human judgments or formal causal criteria; without this, the reported gap between RSI and CCI may reflect VLM annotation artifacts rather than VDM causal understanding.
- [Abstract] Abstract: The evaluation on 13 VDMs reports a dissociation and gap to humans but supplies no quantitative details on CCI computation, error bars, dataset sizes, or controls for VLM bias, preventing verification that the data support the central claim.
minor comments (1)
- The abstract would benefit from explicit references to causal inference literature (e.g., Pearl's do-calculus) and prior VoE implementations to situate the protocol.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below with clarifications on our methodology and indicate planned revisions where appropriate.
read point-by-point responses
-
Referee: Treating temporally reversed videos as 'natural counterfactual samples' is load-bearing for the central dissociation claim, yet reversal simultaneously violates multiple irreversible processes (entropy, gravity, friction) without corresponding to a targeted do-intervention or single-cause counterfactual in a causal graph; this risks conflating general physics-violation detection with causal reasoning.
Authors: We agree that time reversal is not a targeted do-intervention on a single causal variable. Our method draws directly from the Violation of Expectation paradigm, using reversal to create scalable, real-world violations of expected physical dynamics rather than precise graph interventions. RSI quantifies detection of such violations as a necessary (but not sufficient) component of causal perception. We will revise the abstract and method sections to describe these as 'approximate natural counterfactuals' to prevent overstatement. revision: partial
-
Referee: CCI relies on an off-the-shelf VLM to partition videos into causal vs. non-causal subsets with no reported calibration against human judgments or formal causal criteria; without this, the reported gap between RSI and CCI may reflect VLM annotation artifacts rather than VDM causal understanding.
Authors: The concern is valid. The current manuscript applies an off-the-shelf VLM with prompts targeting agent-driven cause-effect relations but does not report human calibration. We will add a human validation study on a data subset, report agreement metrics, and include the exact stratification prompts and criteria in the revised version. revision: yes
-
Referee: The evaluation on 13 VDMs reports a dissociation and gap to humans but supplies no quantitative details on CCI computation, error bars, dataset sizes, or controls for VLM bias, preventing verification that the data support the central claim.
Authors: The full manuscript contains dataset sizes, the CCI formula, and per-model results. We agree that error bars, explicit dataset statistics, and VLM bias controls are insufficiently detailed. We will add a table with video counts, standard errors across VLM runs, and a discussion of bias mitigation in the revision. revision: yes
Circularity Check
No circularity: benchmark metrics are external evaluations, not self-derived
full rationale
The paper introduces RSI (denoising loss on time-reversed videos) and CCI (VLM-based stratification) purely as evaluation protocols applied to existing VDMs. Neither metric is obtained by fitting parameters to the target result, nor does any central claim reduce to a self-citation chain or definitional equivalence. The reported dissociation between arrow-of-time perception and causal understanding follows directly from applying these independent indices to 13 models; no derivation step equates output to input by construction. This is a standard empirical benchmark paper with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Temporally reversing real-world videos at zero cost produces natural counterfactual samples suitable for testing causality
- domain assumption A VLM can accurately stratify videos into causal and non-causal subsets to disentangle causal reasoning from temporal bias
Reference graph
Works this paper leans on
-
[1]
Abdi et al
H. Abdi et al. The kendall rank correlation coefficient.Encyclopedia of measurement and statistics, 2:508–510, 2007
2007
-
[2]
Cosmos World Foundation Model Platform for Physical AI
N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P . Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
T. Ates, M. Ate¸ so˘ glu, Ç. Yi˘ git, I. Kesen, M. Kobas, E. Erdem, A. Erdem, T. Goksun, and D. Yuret. Craft: A benchmark for causal reasoning about forces and interactions. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2602–2627, 2022
2022
- [4]
-
[5]
Baillargeon
R. Baillargeon. Infants’ physical world.Current directions in psychological science, 13(3):89–94, 2004
2004
-
[6]
Baillargeon, E
R. Baillargeon, E. S. Spelke, and S. Wasserman. Object permanence in five-month-old infants.Cognition, 20(3):191–208, 1985
1985
-
[7]
VideoPhy: Evaluating Physical Commonsense for Video Generation
H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y. Bitton, C. Jiang, Y. Sun, K.-W. Chang, and A. Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
H. Bansal, C. Peng, Y. Bitton, R. Goldenberg, A. Grover, and K.-W. Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025
-
[9]
Bar-Tal, H
O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat, J. Hur, G. Liu, A. Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024
2024
-
[10]
F. Baradel, N. Neverova, J. Mille, G. Mori, and C. Wolf. Cophy: Counterfactual learning of physical dynamics.arXiv preprint arXiv:1909.12000, 2019
-
[11]
P . W. Battaglia, J. B. Hamrick, and J. B. Tenenbaum. Simulation as an engine of physical scene understanding.Proceedings of the national academy of sciences, 110(45):18327–18332, 2013
2013
- [12]
-
[13]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V . Voleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. 23 Figure A.5 Scaling laws and generational trends in causal cognition.Aggregate causal cognition rank correlates positively with both release date ( ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Blattmann, R
A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023
2023
- [15]
-
[16]
Brooks, B
T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
2024
-
[17]
Bruce, M
J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024
2024
- [18]
-
[19]
Chandrasegaran, A
K. Chandrasegaran, A. Gupta, L. M. Hadzic, T. Kota, J. He, C. Eyzaguirre, Z. Durante, M. Li, J. Wu, and L. Fei-Fei. Hourvideo: 1-hour video-language understanding.Advances in Neural Information Processing Systems, 37:53168–53197, 2024
2024
-
[20]
C.-H. Chao, W.-F. Sun, B.-W. Cheng, Y.-C. Lo, C.-C. Chang, Y.-L. Liu, Y.-L. Chang, C.-P . Chen, and C.-Y. Lee. Denoising likelihood score matching for conditional score-based data generation.arXiv preprint arXiv:2203.14206, 2022
-
[21]
Y. Chen, J. Liu, X. Lin, and R. Tang. Countervqa: Evaluating and improving counterfactual reasoning in vision-language models for video understanding.arXiv preprint arXiv:2511.19923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
H. Chi, H. Li, W. Yang, F. Liu, L. Lan, X. Ren, T. Liu, and B. Han. Unveiling causal reasoning in large language models: Reality or mirage?Advances in Neural Information Processing Systems, 37:96640–96670, 2024
2024
-
[23]
Clark and P
K. Clark and P . Jaini. Text-to-image diffusion models are zero shot classifiers.Advances in Neural Information Processing Systems, 36:58921–58937, 2023
2023
-
[24]
Cores, M
D. Cores, M. Dorkenwald, M. Mucientes, C. G. Snoek, and Y. M. Asano. Tvbench: Redesigning video-language evaluation. 2024
2024
-
[25]
Croitoru, V
F.-A. Croitoru, V . Hondru, R. T. Ionescu, and M. Shah. Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10850–10869, 2023
2023
-
[26]
A. Dasgupta, J. Duan, M. H. Ang Jr, and C. Tan. Avoe: a synthetic 3d dataset on understanding violation of expectation for artificial cognition.arXiv preprint arXiv:2110.05836, 2021
-
[27]
Didelez and I
V . Didelez and I. Pigeot. Causality: models, reasoning, and inference, 2001
2001
- [28]
-
[29]
Dummett.Principles of electoral reform
M. Dummett.Principles of electoral reform. Oxford University Press, 1997
1997
-
[30]
P . Emerson. The original borda count and partial voting.Social Choice and Welfare, 40(2):353–358, 2013
2013
-
[31]
Esser, J
P . Esser, J. Chiu, P . Atighehchian, J. Granskog, and A. Germanidis. Structure and content-guided video synthesis with diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 7346–7356, 2023
2023
- [32]
-
[33]
C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y. Shen, M. Zhang, et al. Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025
2025
-
[34]
Gandhi, G
K. Gandhi, G. Stojnic, B. M. Lake, and M. R. Dillon. Baby intuitions benchmark (bib): Discerning the goals, preferences, and actions of others.Advances in neural information processing systems, 34:9963–9976, 2021
2021
-
[35]
Q. Garrido, N. Ballas, M. Assran, A. Bardes, L. Najman, M. Rabbat, E. Dupoux, and Y. LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025
-
[36]
S. Ge, A. Mahapatra, G. Parmar, J.-Y. Zhu, and J.-B. Huang. On the content bias in fréchet video distance. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7277–7288, 2024
2024
-
[37]
Girdhar, M
R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra. Factorizing text-to-video generation by explicit image conditioning. InEuropean Conference on Computer Vision, pages 205–224. Springer, 2024
2024
-
[38]
Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, and B. Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Gupta, L
A. Gupta, L. Yu, K. Sohn, X. Gu, M. Hahn, F.-F. Li, I. Essa, L. Jiang, and J. Lezama. Photorealistic video generation with diffusion models. InEuropean Conference on Computer Vision, pages 393–411. Springer, 2024
2024
-
[40]
D. Ha and J. Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[41]
LTX-Video: Realtime Video Latent Diffusion
Y. HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Hafner, T
D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019
2019
-
[43]
Mastering Atari with Discrete World Models
D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[44]
Hanyu, K
N. Hanyu, K. Watanabe, and S. Kitazawa. Ready to detect a reversal of time’s arrow: a psychophysical study using short video clips in daily scenes.Royal Society open science, 10(4), 2023
2023
-
[45]
J. Ho, A. Jain, and P . Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[46]
Classifier-Free Diffusion Guidance
J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. Advances in neural information processing systems, 35:8633–8646, 2022
2022
-
[48]
W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Huang, Y
Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
2024
-
[51]
Huang, F
Z. Huang, F. Zhang, X. Xu, Y. He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y. Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 25
2025
-
[52]
A. Hurst, A. Lerer, A. P . Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Z. Jin, Y. Chen, F. Leeb, L. Gresele, O. Kamal, Z. Lyu, K. Blin, F. Gonzalez Adauto, M. Kleiman-Weiner, M. Sachan, et al. Cladder: Assessing causal reasoning in language models.Advances in Neural Information Processing Systems, 36:31038–31065, 2023
2023
- [54]
-
[55]
B. Kang, Y. Yue, R. Lu, Z. Lin, Y. Zhao, K. Wang, G. Huang, and J. Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[57]
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P . Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[58]
Kiciman, R
E. Kiciman, R. Ness, A. Sharma, and C. Tan. Causal reasoning and large language models: Opening a new frontier for causality.Transactions on Machine Learning Research, 2023
2023
-
[59]
Kingma, T
D. Kingma, T. Salimans, B. Poole, and J. Ho. Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021
2021
-
[60]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuan- video: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like people.Behavioral and brain sciences, 40:e253, 2017
2017
-
[63]
D. Layzer. The arrow of time.Scientific American, 233(6):56–69, 1975
1975
-
[64]
LeCun et al
Y. LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022
2022
-
[65]
A. M. Leslie and S. Keeble. Do six-month-old infants perceive causality?Cognition, 25(3):265–288, 1987
1987
-
[66]
A. C. Li, M. Prabhudesai, S. Duggal, E. Brown, and D. Pathak. Your diffusion model is secretly a zero-shot classifier. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2206–2217, 2023
2023
- [67]
- [68]
-
[69]
J. Li, L. Niu, and L. Zhang. From representation to reasoning: Towards both evidence and common- sense reasoning for video question-answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21273–21282, 2022
2022
-
[70]
K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, Y. Liu, Z. Wang, J. Xu, G. Chen, P . Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024
2024
-
[71]
S. Li, L. Li, Y. Liu, S. Ren, Y. Liu, R. Gao, X. Sun, and L. Hou. Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models. InEuropean Conference on Computer Vision, pages 331–348. Springer, 2024
2024
- [72]
- [73]
- [74]
- [75]
-
[76]
Y. Liu, S. Li, Y. Liu, Y. Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024
2024
-
[77]
Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang, H. Sun, J. Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[78]
Margoni, L
F. Margoni, L. Surian, and R. Baillargeon. The violation-of-expectation paradigm: A conceptual overview.Psychological Review, 131(3):716, 2024
2024
-
[79]
Matsuo, Y
Y. Matsuo, Y. LeCun, M. Sahani, D. Precup, D. Silver, M. Sugiyama, E. Uchibe, and J. Morimoto. Deep learning, reinforcement learning, and world models.Neural Networks, 152:267–275, 2022
2022
-
[80]
R. P . McDonald. Judea pearl. causality: Models, reasoning, and inference. cambridge: Cambridge university press. 384 pp., 2000, isbn 0521773628.Psychometrika, 67(2):321–322, 2002
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.