Do generative video models understand physical principles?
Pith reviewed 2026-05-20 12:42 UTC · model grok-4.3
The pith
Current generative video models show severely limited understanding of physical principles, unrelated to how realistic their outputs appear.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluating a range of current video generation models on the Physics-IQ benchmark establishes that their physical understanding is severely limited and shows no relation to the visual realism of the generated videos, even though some specific physical principles can already be solved successfully from observation alone.
What carries the argument
The Physics-IQ benchmark dataset, consisting of test cases that can only be solved by applying physical principles from fluid dynamics, optics, solid mechanics, magnetism, and thermodynamics.
If this is right
- Visual realism in generated videos does not guarantee correct physical behavior.
- Certain physical principles can be acquired from observational data alone.
- Substantial challenges remain for models to achieve broad physical understanding.
- Progress toward reliable world models will require addressing these specific gaps.
Where Pith is reading between the lines
- Training objectives focused on pixel prediction may need supplementation with explicit physical constraints to improve results.
- The benchmark could be extended to track whether future models generalize physical rules to entirely new scenarios.
- Applications such as robotics planning or scientific visualization may still require separate physics engines even with realistic video outputs.
Load-bearing premise
Solving the Physics-IQ test cases requires acquiring a genuine understanding of physical principles rather than succeeding through statistical patterns or memorization from training data.
What would settle it
A model that scores highly on Physics-IQ yet produces videos violating the tested physical principles in novel, out-of-distribution scenarios would indicate success without understanding.
read the original abstract
AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn "world models" that discover laws of physics -- or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality? We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited, and unrelated to visual realism. At the same time, some test cases can already be successfully solved. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding. Our project page is at https://physics-iq.github.io; code at https://github.com/google-deepmind/physics-IQ-benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Physics-IQ, a new benchmark dataset for testing physical understanding in generative video models. It evaluates models including Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet, concluding that physical understanding remains severely limited across these systems and is unrelated to visual realism, while some individual test cases can already be solved.
Significance. If the benchmark cases truly require deep physical principles rather than permitting statistical shortcuts, the results would provide clear evidence that advances in visual realism do not imply acquisition of world models or physical laws. The public release of the benchmark and code at https://github.com/google-deepmind/physics-IQ-benchmark is a positive contribution that supports reproducibility and follow-on work.
major comments (1)
- [Abstract] Abstract: The assertion that Physics-IQ 'can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics' is load-bearing for the central claim that model failures indicate absence of understanding. No explicit controls, ablations, or evidence are described to rule out solutions via statistical pattern matching, memorization of common video statistics, or visual heuristics from training data.
minor comments (1)
- [Methods] The manuscript would benefit from additional detail on task construction, exact scoring procedures, and any statistical controls for confounds in the benchmark design.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive remarks on the benchmark's release and reproducibility. We address the major comment regarding the abstract below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that Physics-IQ 'can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics' is load-bearing for the central claim that model failures indicate absence of understanding. No explicit controls, ablations, or evidence are described to rule out solutions via statistical pattern matching, memorization of common video statistics, or visual heuristics from training data.
Authors: We appreciate this observation, as the strength of the claim does depend on the benchmark probing understanding beyond surface statistics. The test cases were deliberately constructed around specific physical principles applied in combinations and contexts that are not directly recoverable from typical training video distributions or simple visual heuristics; model failures frequently produce outcomes that violate conservation laws or causal structure in ways inconsistent with pattern completion. Nevertheless, the manuscript does not present explicit ablations or controls that quantify the contribution of statistical shortcuts, which is a fair critique. We will revise the abstract to replace the phrasing 'can only be solved by acquiring a deep understanding' with 'is designed to require a deep understanding' and will add a dedicated paragraph in the methods section discussing benchmark construction choices intended to reduce the efficacy of memorization and heuristics. This is a partial revision because the empirical results and overall conclusions are unaffected. revision: partial
Circularity Check
No circularity: benchmark evaluation is independent of self-referential inputs
full rationale
The paper introduces the Physics-IQ benchmark as a new dataset designed to test physical principles and evaluates third-party models (Sora, Runway, etc.) on it. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The assertion that test cases 'can only be solved by acquiring a deep understanding' is a design premise for the benchmark rather than a result derived from the paper's own equations or prior author work. The central findings rest on empirical performance metrics against external models and are therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The benchmark tasks can only be solved by acquiring a deep understanding of various physical principles
Lean theorems connected to this paper
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics.
-
Foundation.LawOfExistenceexistence_economically_inevitable unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our work demonstrates that visual realism does not imply physical understanding.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
PhysInOne: Visual Physics Learning and Reasoning in One Suite
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...
-
Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls
Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency sup...
-
OSCBench: Benchmarking Object State Change in Text-to-Video Generation
OSCBench demonstrates that text-to-video models produce inaccurate and temporally inconsistent object state changes, with performance dropping sharply on novel and compositional action scenarios.
-
Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque
BasPhyCo is the first physical commonsense reasoning dataset for Basque and dialects, showing LLMs have limited performance on verifiability tasks especially with dialects.
-
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...
-
NEWTON: Agentic Planning for Physically Grounded Video Generation
NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.
-
PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics
PH-Dreamer integrates a port-Hamiltonian framework into generative world models to enforce physical priors, yielding tighter imagined-real reward alignment and reduced latent space volume on visual control benchmarks.
-
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
-
ProPhy: Progressive Physical Alignment for Dynamic World Simulation
ProPhy adds explicit physics-aware conditioning via semantic and refinement experts plus VLM knowledge transfer to produce more physically coherent dynamic videos than prior methods.
-
Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility
A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.
-
Video models are zero-shot learners and reasoners
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
-
Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.
-
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
Phantom generates visually realistic and physically consistent videos by jointly modeling visual content and latent physical dynamics via an abstract physics-aware representation.
-
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.
-
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
-
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-groun...
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Reference graph
Works this paper leans on
-
[1]
Sora: OpenAI’s Multimodal Agent
OpenAI. Sora: OpenAI’s Multimodal Agent. https://openai.com/index/sora/, 2024. Accessed: 2024-11-24
work page 2024
-
[2]
Veo2: Our state-of-the-art video generation model
DeepMind. Veo2: Our state-of-the-art video generation model. https://deepmind.google/technologies/veo/veo-2/, 2024. Accessed: 2025-01-09
work page 2024
-
[3]
Meta Movie Gen: AI-powered movie generation
Meta AI. Meta Movie Gen: AI-powered movie generation. https://ai.meta.com/research/movie-gen/, 2024. Accessed: 2024-11-24
work page 2024
-
[4]
Possible principles underlying the transformation of sensory messages
Horace B Barlow et al. Possible principles underlying the transformation of sensory messages. Sensory communication, 1(01):217–233, 1961
work page 1961
-
[5]
Hermann von Helmholtz. Handbuch der physiologischen Optik: mit 213 in den Text eingedruckten Holzschnitten und 11 Tafeln, volume 9. Voss, 1867
-
[6]
A theory of cortical responses
Karl Friston. A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences, 360(1456):815–836, 2005
work page 2005
-
[7]
Shortcut learning in deep neural networks
Robert Geirhos, J¨orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020
work page 2020
-
[8]
How Far is Video Generation from World Model: A Physical Law Perspective
Bingyi Kang, Y ang Yue, Rui Lu, Zhijie Lin, Y ang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Runway Team. Runway. https://runwayml.com, 2024. Platform for AI-powered video editing and generative media creation
work page 2024
- [10]
-
[11]
Lumiere: A space-time diffusion model for video generation, 2024
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation, 2024
work page 2024
-
[12]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Y am Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Y an, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Y air Alon, Y ong Cheng, Joshua V. Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Y ang, Hartw...
work page 2024
-
[14]
Generalisation in humans and deep neural networks
Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Sch ¨utt, Matthias Bethge, and Felix A Wichmann. Generalisation in humans and deep neural networks. Advances in neural information processing systems, 31, 2018
work page 2018
-
[15]
Benchmarking neural network robustness to common corruptions and perturbations
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2018
work page 2018
-
[16]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri`a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Bear, Elias Wang, Damian Mrowca, Felix J
Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiao-Yu Fish Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, Li Fei-Fei, Nancy Kanwisher, Joshua B. Tenenbaum, Daniel L. K. Y amins, and Judith E. Fan. Physion: Evaluating physical prediction from vision in humans and machines, 2021
work page 2021
-
[18]
Hsiao-Yu Tung, Mingyu Ding, Zhenfang Chen, Daniel Bear, Chuang Gan, Josh Tenenbaum, Dan Y amins, Judith Fan, and Kevin Smith. Physion++: Evaluating physical scene understanding that requires online inference of different physical properties. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[19]
Craft: A benchmark for causal reasoning about forces and interactions
Tayfun Ates, M Samil Atesoglu, Cagatay Yigit, Ilker Kesen, Mert Kobas, Erkut Erdem, Aykut Erdem, Tilbe Goksun, and Deniz Yuret. Craft: A benchmark for causal reasoning about forces and interactions. arXiv preprint arXiv:2012.04293, 2020
-
[20]
IntPhys: A framework and benchmark for visual intuitive physics reasoning
Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, V´eronique Izard, and Emmanuel Dupoux. IntPhys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616, 2018
-
[21]
Curvilinear motion in the absence of external forces: Naive beliefs about the motion of objects
Michael McCloskey, Alfonso Caramazza, and Bert Green. Curvilinear motion in the absence of external forces: Naive beliefs about the motion of objects. Science, 210(4474):1139–1141, 1980
work page 1980
-
[22]
Michael McCloskey. Intuitive physics. Scientific american, 248(4):122–131, 1983
work page 1983
-
[23]
Perception of partly occluded objects in infancy
Philip J Kellman and Elizabeth S Spelke. Perception of partly occluded objects in infancy. Cognitive psychology, 15(4):483–524, 1983
work page 1983
-
[24]
Elizabeth S Spelke, Karen Breinlinger, Janet Macomber, and Kristen Jacobson. Origins of knowledge. Psychological review, 99(4):605, 1992
work page 1992
-
[25]
Spatiotemporal continuity, smoothness of motion and object identity in infancy
Elizabeth S Spelke, Roberta Kestenbaum, Daniel J Simons, and Debra Wein. Spatiotemporal continuity, smoothness of motion and object identity in infancy. British journal of developmental psychology, 13(2):113–142, 1995
work page 1995
-
[26]
A theory of causal learning in children: causal maps and bayes nets
Alison Gopnik, Clark Glymour, David M Sobel, Laura E Schulz, Tamar Kushnir, and David Danks. A theory of causal learning in children: causal maps and bayes nets. Psychological review, 111(1):3, 2004
work page 2004
-
[27]
The perception of causality in infancy
Rebecca Saxe and Susan Carey. The perception of causality in infancy. Acta psychologica, 123(1-2):144–165, 2006
work page 2006
-
[28]
Learning to poke by poking: Experiential learning of intuitive physics
Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics. Advances in neural information processing systems, 29, 2016
work page 2016
-
[29]
Intuitive physics: Current research and controversies
James R Kubricht, Keith J Holyoak, and Hongjing Lu. Intuitive physics: Current research and controversies. Trends in cognitive sciences, 21(10):749–759, 2017
work page 2017
-
[30]
How to grow a mind: Statistics, structure, and abstraction
Joshua B Tenenbaum, Charles Kemp, Thomas L Griffiths, and Noah D Goodman. How to grow a mind: Statistics, structure, and abstraction. science, 331(6022):1279–1285, 2011
work page 2011
-
[31]
Intuitive physics learning in a deep-learning model inspired by developmental psychology
Luis S Piloto, Ari Weinstein, Peter Battaglia, and Matthew Botvinick. Intuitive physics learning in a deep-learning model inspired by developmental psychology. Nature human behaviour, 6 (9):1257–1267, 2022
work page 2022
-
[32]
Videophy: Evaluating physical commonsense for video generation, 2024
Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Y arom, Y onatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation, 2024
work page 2024
-
[33]
Towards world simulator: Crafting physical commonsense-based benchmark for video generation, 2024
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation, 2024
work page 2024
-
[34]
Physgame: Uncovering physical commonsense violations in gameplay videos
Meng Cao, Haoran Tang, Haoze Zhao, Hangyu Guo, Jiaheng Liu, Ge Zhang, Ruyang Liu, Qiang Sun, Ian Reid, and Xiaodan Liang. Physgame: Uncovering physical commonsense violations in gameplay videos. arXiv preprint arXiv:2412.01800, 2024
-
[35]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Y ogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Y ongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical AI. arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres. LLMPhy: Complex physical reasoning using large language models and world models. arXiv preprint arXiv:2411.08027, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Cophy: Counterfactual learning of physical dynamics
Fabien Baradel, Natalia Neverova, Julien Mille, Greg Mori, and Christian Wolf. Cophy: Counterfactual learning of physical dynamics. arXiv preprint arXiv:1909.12000, 2019
-
[38]
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[39]
Esprit: Explaining solutions to physical reasoning tasks
Nazneen Fatema Rajani, Rui Zhang, Yi Chern Tan, Stephan Zheng, Jeremy Weiss, Aadit Vyas, Abhijit Gupta, Caiming Xiong, Richard Socher, and Dragomir Radev. Esprit: Explaining solutions to physical reasoning tasks. arXiv preprint arXiv:2005.00730, 2020
-
[40]
How far is video generation from world model: A physical law perspective, 2024
Bingyi Kang, Y ang Yue, Rui Lu, Zhijie Lin, Y ang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective, 2024
work page 2024
-
[41]
Generative physical AI in vision: A survey
Daochang Liu, Junyu Zhang, Anh-Dung Dinh, Eunbyung Park, Shichao Zhang, and Chang Xu. Generative physical AI in vision: A survey. arXiv preprint arXiv:2501.10928, 2025
-
[42]
Luma AI Team. Luma ai. https://lumalabs.ai, 2024. Generative AI platform specializing in 3D content and photorealistic modeling
work page 2024
-
[43]
Image quality metrics: Psnr vs
Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 9 Do generative video models understand physical principles?
work page 2010
-
[44]
Image quality assessment: from error visibility to structural similarity
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004
work page 2004
-
[45]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[46]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018
work page 2018
-
[47]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
work page 2024
-
[48]
Vbench++: Comprehensive and versatile benchmark suite for video generative models
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint arXiv:2411.13503, 2024
-
[49]
Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation
Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. arXiv preprint arXiv:2406.15252, 2024
-
[50]
On the content bias in frechet video distance
Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Y an Zhu, and Jia-Bin Huang. On the content bias in frechet video distance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7277–7288, June 2024
work page 2024
-
[51]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team Google: Petko Georgiev and 1133 other authors. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. URL https://arxiv.org/abs/2403.05530
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Telling more than we can know: Verbal reports on mental processes
Richard E Nisbett and Timothy D Wilson. Telling more than we can know: Verbal reports on mental processes. Psychological review, 84(3):231, 1977
work page 1977
-
[53]
Building machines that learn and think like people
Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017
work page 2017
-
[54]
Tiered reasoning for intuitive physics: Toward verifiable commonsense language understanding
Shane Storks, Qiaozi Gao, Yichi Zhang, and Joyce Chai. Tiered reasoning for intuitive physics: Toward verifiable commonsense language understanding. In Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP) 2021, 2021
work page 2021
-
[55]
Benchmarking progress to infant-level physical reasoning in ai
Luca Weihs, Amanda Yuile, Ren ´ee Baillargeon, Cynthia Fisher, Gary Marcus, Roozbeh Mottaghi, and Aniruddha Kembhavi. Benchmarking progress to infant-level physical reasoning in ai. Transactions on Machine Learning Research, 2022
work page 2022
-
[56]
Marcel Binz and Eric Schulz. Using cognitive psychology to understand GPT -3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023
work page 2023
-
[57]
Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. GRASP: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models. arXiv preprint arXiv:2311.09048, 2023
-
[58]
Visual cognition in multimodal large language models
Luca M Schulze Buschoff, Elif Akata, Matthias Bethge, and Eric Schulz. Visual cognition in multimodal large language models. Nature Machine Intelligence, pages 1–11, 2025
work page 2025
-
[59]
Vision language models are blind
Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. In Proceedings of the Asian Conference on Computer Vision, pages 18–34, 2024
work page 2024
-
[60]
A structural probe for finding syntax in word representations
John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, 2019
work page 2019
-
[61]
A Survey of Hallucination in Large Foundation Models
Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
The origins of physical knowledge
Elizabeth S Spelke. The origins of physical knowledge. Clarendon Press/Oxford University Press, 1988
work page 1988
-
[63]
The acquisition of physical knowledge in infancy: A summary in eight lessons
Ren ´ee Baillargeon. The acquisition of physical knowledge in infancy: A summary in eight lessons. Blackwell handbook of childhood cognitive development, pages 47–83, 2002
work page 2002
-
[64]
Grounding intuitive physics in perceptual experience
Michele Vicovaro. Grounding intuitive physics in perceptual experience. Journal of Intelligence, 11(10):187, 2023
work page 2023
-
[65]
A Compositional Object-Based Approach to Learning Physical Dynamics
Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[66]
Jason Fischer and Bradford Z Mahon. What tool representation, intuitive physics, and action have in common: The brain’s first-person physics engine.Cognitive neuropsychology, 38 (7-8):455–467, 2021
work page 2021
-
[67]
An approximate representation of objects underlies physical reasoning
Yichen Li, YingQiao Wang, Tal Boger, Kevin A Smith, Samuel J Gershman, and Tomer D Ullman. An approximate representation of objects underlies physical reasoning. Journal of Experimental Psychology: General, 2023
work page 2023
-
[68]
Blending simulation and abstraction for physical reasoning
Felix A Sosa, Samuel J Gershman, and Tomer D Ullman. Blending simulation and abstraction for physical reasoning. Cognition, 254:105995, 2025
work page 2025
-
[69]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. OpenAI o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Y ang, Y andong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv preprint arXiv:2501.09732, 2025. 10 Do generative video models understand physical principles? Supplementary Material Fig. 8. Illustration ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.