Do generative video models understand physical principles?

Kevin Swersky; Laura Culp; Priyank Jaini; Robert Geirhos; Saman Motamed

arxiv: 2501.09038 · v3 · pith:XKNKDKXEnew · submitted 2025-01-14 · 💻 cs.CV · cs.AI· cs.GR· cs.LG

Do generative video models understand physical principles?

Saman Motamed , Laura Culp , Kevin Swersky , Priyank Jaini , Robert Geirhos This is my paper

Pith reviewed 2026-05-20 12:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GRcs.LG

keywords generative video modelsphysical principlesbenchmark datasetPhysics-IQvisual realismfluid dynamicsopticsworld models

0 comments

The pith

Current generative video models show severely limited understanding of physical principles, unrelated to how realistic their outputs appear.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Physics-IQ, a benchmark of video generation tasks that require grasping principles such as fluid dynamics, optics, solid mechanics, magnetism, and thermodynamics. Testing models including Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet reveals poor performance overall, showing that visual realism does not equate to physical knowledge. Some individual test cases are handled correctly, which suggests that certain principles can be acquired from data alone. The work concludes that significant gaps remain in building models that truly capture physical laws from observation.

Core claim

Evaluating a range of current video generation models on the Physics-IQ benchmark establishes that their physical understanding is severely limited and shows no relation to the visual realism of the generated videos, even though some specific physical principles can already be solved successfully from observation alone.

What carries the argument

The Physics-IQ benchmark dataset, consisting of test cases that can only be solved by applying physical principles from fluid dynamics, optics, solid mechanics, magnetism, and thermodynamics.

If this is right

Visual realism in generated videos does not guarantee correct physical behavior.
Certain physical principles can be acquired from observational data alone.
Substantial challenges remain for models to achieve broad physical understanding.
Progress toward reliable world models will require addressing these specific gaps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives focused on pixel prediction may need supplementation with explicit physical constraints to improve results.
The benchmark could be extended to track whether future models generalize physical rules to entirely new scenarios.
Applications such as robotics planning or scientific visualization may still require separate physics engines even with realistic video outputs.

Load-bearing premise

Solving the Physics-IQ test cases requires acquiring a genuine understanding of physical principles rather than succeeding through statistical patterns or memorization from training data.

What would settle it

A model that scores highly on Physics-IQ yet produces videos violating the tested physical principles in novel, out-of-distribution scenarios would indicate success without understanding.

read the original abstract

AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn "world models" that discover laws of physics -- or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality? We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited, and unrelated to visual realism. At the same time, some test cases can already be successfully solved. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding. Our project page is at https://physics-iq.github.io; code at https://github.com/google-deepmind/physics-IQ-benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Physics-IQ shows current video models have limited physical understanding that does not track with visual realism, but the benchmark's design leaves room for pattern-matching explanations.

read the letter

Current video models have limited physical understanding that does not track with how realistic their videos look. The paper introduces Physics-IQ to test this directly across domains like fluid dynamics, optics, solid mechanics, magnetism, and thermodynamics. They run several leading systems including Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet, and report that overall performance on the physical tasks stays low while visual quality varies independently. A few individual cases do get solved, which suggests some principles can be picked up from data alone.

Referee Report

1 major / 1 minor

Summary. The paper introduces Physics-IQ, a new benchmark dataset for testing physical understanding in generative video models. It evaluates models including Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet, concluding that physical understanding remains severely limited across these systems and is unrelated to visual realism, while some individual test cases can already be solved.

Significance. If the benchmark cases truly require deep physical principles rather than permitting statistical shortcuts, the results would provide clear evidence that advances in visual realism do not imply acquisition of world models or physical laws. The public release of the benchmark and code at https://github.com/google-deepmind/physics-IQ-benchmark is a positive contribution that supports reproducibility and follow-on work.

major comments (1)

[Abstract] Abstract: The assertion that Physics-IQ 'can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics' is load-bearing for the central claim that model failures indicate absence of understanding. No explicit controls, ablations, or evidence are described to rule out solutions via statistical pattern matching, memorization of common video statistics, or visual heuristics from training data.

minor comments (1)

[Methods] The manuscript would benefit from additional detail on task construction, exact scoring procedures, and any statistical controls for confounds in the benchmark design.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive remarks on the benchmark's release and reproducibility. We address the major comment regarding the abstract below.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that Physics-IQ 'can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics' is load-bearing for the central claim that model failures indicate absence of understanding. No explicit controls, ablations, or evidence are described to rule out solutions via statistical pattern matching, memorization of common video statistics, or visual heuristics from training data.

Authors: We appreciate this observation, as the strength of the claim does depend on the benchmark probing understanding beyond surface statistics. The test cases were deliberately constructed around specific physical principles applied in combinations and contexts that are not directly recoverable from typical training video distributions or simple visual heuristics; model failures frequently produce outcomes that violate conservation laws or causal structure in ways inconsistent with pattern completion. Nevertheless, the manuscript does not present explicit ablations or controls that quantify the contribution of statistical shortcuts, which is a fair critique. We will revise the abstract to replace the phrasing 'can only be solved by acquiring a deep understanding' with 'is designed to require a deep understanding' and will add a dedicated paragraph in the methods section discussing benchmark construction choices intended to reduce the efficacy of memorization and heuristics. This is a partial revision because the empirical results and overall conclusions are unaffected. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark evaluation is independent of self-referential inputs

full rationale

The paper introduces the Physics-IQ benchmark as a new dataset designed to test physical principles and evaluates third-party models (Sora, Runway, etc.) on it. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The assertion that test cases 'can only be solved by acquiring a deep understanding' is a design premise for the benchmark rather than a result derived from the paper's own equations or prior author work. The central findings rest on empirical performance metrics against external models and are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new benchmark without free parameters or invented physical entities; it rests on the domain assumption that benchmark success requires genuine physical understanding.

axioms (1)

domain assumption The benchmark tasks can only be solved by acquiring a deep understanding of various physical principles
This premise is stated directly in the abstract as the justification for using the dataset to measure physical understanding.

pith-pipeline@v0.9.0 · 5746 in / 1102 out tokens · 58681 ms · 2026-05-20T12:42:12.954307+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics.
Foundation.LawOfExistence existence_economically_inevitable unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our work demonstrates that visual realism does not imply physical understanding.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PhysInOne: Visual Physics Learning and Reasoning in One Suite
cs.CV 2026-04 unverdicted novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...
Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls
cs.CV 2026-05 unverdicted novelty 7.0

Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency sup...
OSCBench: Benchmarking Object State Change in Text-to-Video Generation
cs.CV 2026-03 unverdicted novelty 7.0

OSCBench demonstrates that text-to-video models produce inaccurate and temporally inconsistent object state changes, with performance dropping sharply on novel and compositional action scenarios.
Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque
cs.CL 2026-02 conditional novelty 7.0

BasPhyCo is the first physical commonsense reasoning dataset for Basque and dialects, showing LLMs have limited performance on verifiability tasks especially with dialects.
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
cs.RO 2025-05 unverdicted novelty 7.0

DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...
NEWTON: Agentic Planning for Physically Grounded Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.
PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics
cs.LG 2026-05 unverdicted novelty 6.0

PH-Dreamer integrates a port-Hamiltonian framework into generative world models to enforce physical priors, yielding tighter imagined-real reward alignment and reduced latent space volume on visual control benchmarks.
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
cs.CV 2026-04 unverdicted novelty 6.0

A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
ProPhy: Progressive Physical Alignment for Dynamic World Simulation
cs.CV 2025-12 unverdicted novelty 6.0

ProPhy adds explicit physics-aware conditioning via semantic and refinement experts plus VLM knowledge transfer to produce more physically coherent dynamic videos than prior methods.
Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility
cs.CV 2025-09 unverdicted novelty 6.0

A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.
Video models are zero-shot learners and reasoners
cs.LG 2025-09 unverdicted novelty 6.0

Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
cs.RO 2025-07 unverdicted novelty 6.0

RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
cs.CV 2025-05 conditional novelty 6.0

FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
cs.CV 2026-04 unverdicted novelty 5.0

Phantom generates visually realistic and physically consistent videos by jointly modeling visual content and latent physical dynamics via an abstract physics-aware representation.
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
cs.CV 2025-11 unverdicted novelty 5.0

MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 4.0

DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
World Simulation with Video Foundation Models for Physical AI
cs.CV 2025-10 unverdicted novelty 4.0

Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI
cs.AI 2025-10 unverdicted novelty 4.0

A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-groun...
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 20 Pith papers · 13 internal anchors

[1]

Sora: OpenAI’s Multimodal Agent

OpenAI. Sora: OpenAI’s Multimodal Agent. https://openai.com/index/sora/, 2024. Accessed: 2024-11-24

work page 2024
[2]

Veo2: Our state-of-the-art video generation model

DeepMind. Veo2: Our state-of-the-art video generation model. https://deepmind.google/technologies/veo/veo-2/, 2024. Accessed: 2025-01-09

work page 2024
[3]

Meta Movie Gen: AI-powered movie generation

Meta AI. Meta Movie Gen: AI-powered movie generation. https://ai.meta.com/research/movie-gen/, 2024. Accessed: 2024-11-24

work page 2024
[4]

Possible principles underlying the transformation of sensory messages

Horace B Barlow et al. Possible principles underlying the transformation of sensory messages. Sensory communication, 1(01):217–233, 1961

work page 1961
[5]

Handbuch der physiologischen Optik: mit 213 in den Text eingedruckten Holzschnitten und 11 Tafeln, volume 9

Hermann von Helmholtz. Handbuch der physiologischen Optik: mit 213 in den Text eingedruckten Holzschnitten und 11 Tafeln, volume 9. Voss, 1867

work page
[6]

A theory of cortical responses

Karl Friston. A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences, 360(1456):815–836, 2005

work page 2005
[7]

Shortcut learning in deep neural networks

Robert Geirhos, J¨orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020
[8]

How Far is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang, Y ang Yue, Rui Lu, Zhijie Lin, Y ang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Runway Team. Runway. https://runwayml.com, 2024. Platform for AI-powered video editing and generative media creation

work page 2024
[10]

Pika labs

Pika Labs Team. Pika labs. https://pikalabs.com, 2024. Generative AI platform for creating video and visual content

work page 2024
[11]

Lumiere: A space-time diffusion model for video generation, 2024

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation, 2024

work page 2024
[12]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Y am Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Y an, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Y air Alon, Y ong Cheng, Joshua V. Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Y ang, Hartw...

work page 2024
[14]

Generalisation in humans and deep neural networks

Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Sch ¨utt, Matthias Bethge, and Felix A Wichmann. Generalisation in humans and deep neural networks. Advances in neural information processing systems, 31, 2018

work page 2018
[15]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2018

work page 2018
[16]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri`a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Bear, Elias Wang, Damian Mrowca, Felix J

Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiao-Yu Fish Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, Li Fei-Fei, Nancy Kanwisher, Joshua B. Tenenbaum, Daniel L. K. Y amins, and Judith E. Fan. Physion: Evaluating physical prediction from vision in humans and machines, 2021

work page 2021
[18]

Physion++: Evaluating physical scene understanding that requires online inference of different physical properties

Hsiao-Yu Tung, Mingyu Ding, Zhenfang Chen, Daniel Bear, Chuang Gan, Josh Tenenbaum, Dan Y amins, Judith Fan, and Kevin Smith. Physion++: Evaluating physical scene understanding that requires online inference of different physical properties. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[19]

Craft: A benchmark for causal reasoning about forces and interactions

Tayfun Ates, M Samil Atesoglu, Cagatay Yigit, Ilker Kesen, Mert Kobas, Erkut Erdem, Aykut Erdem, Tilbe Goksun, and Deniz Yuret. Craft: A benchmark for causal reasoning about forces and interactions. arXiv preprint arXiv:2012.04293, 2020

work page arXiv 2012
[20]

IntPhys: A framework and benchmark for visual intuitive physics reasoning

Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, V´eronique Izard, and Emmanuel Dupoux. IntPhys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616, 2018

work page arXiv 2018
[21]

Curvilinear motion in the absence of external forces: Naive beliefs about the motion of objects

Michael McCloskey, Alfonso Caramazza, and Bert Green. Curvilinear motion in the absence of external forces: Naive beliefs about the motion of objects. Science, 210(4474):1139–1141, 1980

work page 1980
[22]

Intuitive physics

Michael McCloskey. Intuitive physics. Scientific american, 248(4):122–131, 1983

work page 1983
[23]

Perception of partly occluded objects in infancy

Philip J Kellman and Elizabeth S Spelke. Perception of partly occluded objects in infancy. Cognitive psychology, 15(4):483–524, 1983

work page 1983
[24]

Origins of knowledge

Elizabeth S Spelke, Karen Breinlinger, Janet Macomber, and Kristen Jacobson. Origins of knowledge. Psychological review, 99(4):605, 1992

work page 1992
[25]

Spatiotemporal continuity, smoothness of motion and object identity in infancy

Elizabeth S Spelke, Roberta Kestenbaum, Daniel J Simons, and Debra Wein. Spatiotemporal continuity, smoothness of motion and object identity in infancy. British journal of developmental psychology, 13(2):113–142, 1995

work page 1995
[26]

A theory of causal learning in children: causal maps and bayes nets

Alison Gopnik, Clark Glymour, David M Sobel, Laura E Schulz, Tamar Kushnir, and David Danks. A theory of causal learning in children: causal maps and bayes nets. Psychological review, 111(1):3, 2004

work page 2004
[27]

The perception of causality in infancy

Rebecca Saxe and Susan Carey. The perception of causality in infancy. Acta psychologica, 123(1-2):144–165, 2006

work page 2006
[28]

Learning to poke by poking: Experiential learning of intuitive physics

Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics. Advances in neural information processing systems, 29, 2016

work page 2016
[29]

Intuitive physics: Current research and controversies

James R Kubricht, Keith J Holyoak, and Hongjing Lu. Intuitive physics: Current research and controversies. Trends in cognitive sciences, 21(10):749–759, 2017

work page 2017
[30]

How to grow a mind: Statistics, structure, and abstraction

Joshua B Tenenbaum, Charles Kemp, Thomas L Griffiths, and Noah D Goodman. How to grow a mind: Statistics, structure, and abstraction. science, 331(6022):1279–1285, 2011

work page 2011
[31]

Intuitive physics learning in a deep-learning model inspired by developmental psychology

Luis S Piloto, Ari Weinstein, Peter Battaglia, and Matthew Botvinick. Intuitive physics learning in a deep-learning model inspired by developmental psychology. Nature human behaviour, 6 (9):1257–1267, 2022

work page 2022
[32]

Videophy: Evaluating physical commonsense for video generation, 2024

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Y arom, Y onatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation, 2024

work page 2024
[33]

Towards world simulator: Crafting physical commonsense-based benchmark for video generation, 2024

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation, 2024

work page 2024
[34]

Physgame: Uncovering physical commonsense violations in gameplay videos

Meng Cao, Haoran Tang, Haoze Zhao, Hangyu Guo, Jiaheng Liu, Ge Zhang, Ruyang Liu, Qiang Sun, Ian Reid, and Xiaodan Liang. Physgame: Uncovering physical commonsense violations in gameplay videos. arXiv preprint arXiv:2412.01800, 2024

work page arXiv 2024
[35]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Y ogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Y ongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical AI. arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

LLMPhy: Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines

Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres. LLMPhy: Complex physical reasoning using large language models and world models. arXiv preprint arXiv:2411.08027, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Cophy: Counterfactual learning of physical dynamics

Fabien Baradel, Natalia Neverova, Julien Mille, Greg Mori, and Christian Wolf. Cophy: Counterfactual learning of physical dynamics. arXiv preprint arXiv:1909.12000, 2019

work page arXiv 1909
[38]

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[39]

Esprit: Explaining solutions to physical reasoning tasks

Nazneen Fatema Rajani, Rui Zhang, Yi Chern Tan, Stephan Zheng, Jeremy Weiss, Aadit Vyas, Abhijit Gupta, Caiming Xiong, Richard Socher, and Dragomir Radev. Esprit: Explaining solutions to physical reasoning tasks. arXiv preprint arXiv:2005.00730, 2020

work page arXiv 2005
[40]

How far is video generation from world model: A physical law perspective, 2024

Bingyi Kang, Y ang Yue, Rui Lu, Zhijie Lin, Y ang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective, 2024

work page 2024
[41]

Generative physical AI in vision: A survey

Daochang Liu, Junyu Zhang, Anh-Dung Dinh, Eunbyung Park, Shichao Zhang, and Chang Xu. Generative physical AI in vision: A survey. arXiv preprint arXiv:2501.10928, 2025

work page arXiv 2025
[42]

Luma AI Team. Luma ai. https://lumalabs.ai, 2024. Generative AI platform specializing in 3D content and photorealistic modeling

work page 2024
[43]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 9 Do generative video models understand physical principles?

work page 2010
[44]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004

work page 2004
[45]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018
[47]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024
[48]

Vbench++: Comprehensive and versatile benchmark suite for video generative models

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint arXiv:2411.13503, 2024

work page arXiv 2024
[49]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. arXiv preprint arXiv:2406.15252, 2024

work page arXiv 2024
[50]

On the content bias in frechet video distance

Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Y an Zhu, and Jia-Bin Huang. On the content bias in frechet video distance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7277–7288, June 2024

work page 2024
[51]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team Google: Petko Georgiev and 1133 other authors. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. URL https://arxiv.org/abs/2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Telling more than we can know: Verbal reports on mental processes

Richard E Nisbett and Timothy D Wilson. Telling more than we can know: Verbal reports on mental processes. Psychological review, 84(3):231, 1977

work page 1977
[53]

Building machines that learn and think like people

Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017

work page 2017
[54]

Tiered reasoning for intuitive physics: Toward verifiable commonsense language understanding

Shane Storks, Qiaozi Gao, Yichi Zhang, and Joyce Chai. Tiered reasoning for intuitive physics: Toward verifiable commonsense language understanding. In Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP) 2021, 2021

work page 2021
[55]

Benchmarking progress to infant-level physical reasoning in ai

Luca Weihs, Amanda Yuile, Ren ´ee Baillargeon, Cynthia Fisher, Gary Marcus, Roozbeh Mottaghi, and Aniruddha Kembhavi. Benchmarking progress to infant-level physical reasoning in ai. Transactions on Machine Learning Research, 2022

work page 2022
[56]

Using cognitive psychology to understand GPT -3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023

Marcel Binz and Eric Schulz. Using cognitive psychology to understand GPT -3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023

work page 2023
[57]

GRASP: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models

Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. GRASP: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models. arXiv preprint arXiv:2311.09048, 2023

work page arXiv 2023
[58]

Visual cognition in multimodal large language models

Luca M Schulze Buschoff, Elif Akata, Matthias Bethge, and Eric Schulz. Visual cognition in multimodal large language models. Nature Machine Intelligence, pages 1–11, 2025

work page 2025
[59]

Vision language models are blind

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. In Proceedings of the Asian Conference on Computer Vision, pages 18–34, 2024

work page 2024
[60]

A structural probe for finding syntax in word representations

John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, 2019

work page 2019
[61]

A Survey of Hallucination in Large Foundation Models

Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

The origins of physical knowledge

Elizabeth S Spelke. The origins of physical knowledge. Clarendon Press/Oxford University Press, 1988

work page 1988
[63]

The acquisition of physical knowledge in infancy: A summary in eight lessons

Ren ´ee Baillargeon. The acquisition of physical knowledge in infancy: A summary in eight lessons. Blackwell handbook of childhood cognitive development, pages 47–83, 2002

work page 2002
[64]

Grounding intuitive physics in perceptual experience

Michele Vicovaro. Grounding intuitive physics in perceptual experience. Journal of Intelligence, 11(10):187, 2023

work page 2023
[65]

A Compositional Object-Based Approach to Learning Physical Dynamics

Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[66]

What tool representation, intuitive physics, and action have in common: The brain’s first-person physics engine.Cognitive neuropsychology, 38 (7-8):455–467, 2021

Jason Fischer and Bradford Z Mahon. What tool representation, intuitive physics, and action have in common: The brain’s first-person physics engine.Cognitive neuropsychology, 38 (7-8):455–467, 2021

work page 2021
[67]

An approximate representation of objects underlies physical reasoning

Yichen Li, YingQiao Wang, Tal Boger, Kevin A Smith, Samuel J Gershman, and Tomer D Ullman. An approximate representation of objects underlies physical reasoning. Journal of Experimental Psychology: General, 2023

work page 2023
[68]

Blending simulation and abstraction for physical reasoning

Felix A Sosa, Samuel J Gershman, and Tomer D Ullman. Blending simulation and abstraction for physical reasoning. Cognition, 254:105995, 2025

work page 2025
[69]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. OpenAI o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Y ang, Y andong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv preprint arXiv:2501.09732, 2025. 10 Do generative video models understand physical principles? Supplementary Material Fig. 8. Illustration ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Sora: OpenAI’s Multimodal Agent

OpenAI. Sora: OpenAI’s Multimodal Agent. https://openai.com/index/sora/, 2024. Accessed: 2024-11-24

work page 2024

[2] [2]

Veo2: Our state-of-the-art video generation model

DeepMind. Veo2: Our state-of-the-art video generation model. https://deepmind.google/technologies/veo/veo-2/, 2024. Accessed: 2025-01-09

work page 2024

[3] [3]

Meta Movie Gen: AI-powered movie generation

Meta AI. Meta Movie Gen: AI-powered movie generation. https://ai.meta.com/research/movie-gen/, 2024. Accessed: 2024-11-24

work page 2024

[4] [4]

Possible principles underlying the transformation of sensory messages

Horace B Barlow et al. Possible principles underlying the transformation of sensory messages. Sensory communication, 1(01):217–233, 1961

work page 1961

[5] [5]

Handbuch der physiologischen Optik: mit 213 in den Text eingedruckten Holzschnitten und 11 Tafeln, volume 9

Hermann von Helmholtz. Handbuch der physiologischen Optik: mit 213 in den Text eingedruckten Holzschnitten und 11 Tafeln, volume 9. Voss, 1867

work page

[6] [6]

A theory of cortical responses

Karl Friston. A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences, 360(1456):815–836, 2005

work page 2005

[7] [7]

Shortcut learning in deep neural networks

Robert Geirhos, J¨orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020

[8] [8]

How Far is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang, Y ang Yue, Rui Lu, Zhijie Lin, Y ang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Runway Team. Runway. https://runwayml.com, 2024. Platform for AI-powered video editing and generative media creation

work page 2024

[10] [10]

Pika labs

Pika Labs Team. Pika labs. https://pikalabs.com, 2024. Generative AI platform for creating video and visual content

work page 2024

[11] [11]

Lumiere: A space-time diffusion model for video generation, 2024

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation, 2024

work page 2024

[12] [12]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Y am Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Y an, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Y air Alon, Y ong Cheng, Joshua V. Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Y ang, Hartw...

work page 2024

[14] [14]

Generalisation in humans and deep neural networks

Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Sch ¨utt, Matthias Bethge, and Felix A Wichmann. Generalisation in humans and deep neural networks. Advances in neural information processing systems, 31, 2018

work page 2018

[15] [15]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2018

work page 2018

[16] [16]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri`a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Bear, Elias Wang, Damian Mrowca, Felix J

Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiao-Yu Fish Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, Li Fei-Fei, Nancy Kanwisher, Joshua B. Tenenbaum, Daniel L. K. Y amins, and Judith E. Fan. Physion: Evaluating physical prediction from vision in humans and machines, 2021

work page 2021

[18] [18]

Physion++: Evaluating physical scene understanding that requires online inference of different physical properties

Hsiao-Yu Tung, Mingyu Ding, Zhenfang Chen, Daniel Bear, Chuang Gan, Josh Tenenbaum, Dan Y amins, Judith Fan, and Kevin Smith. Physion++: Evaluating physical scene understanding that requires online inference of different physical properties. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[19] [19]

Craft: A benchmark for causal reasoning about forces and interactions

Tayfun Ates, M Samil Atesoglu, Cagatay Yigit, Ilker Kesen, Mert Kobas, Erkut Erdem, Aykut Erdem, Tilbe Goksun, and Deniz Yuret. Craft: A benchmark for causal reasoning about forces and interactions. arXiv preprint arXiv:2012.04293, 2020

work page arXiv 2012

[20] [20]

IntPhys: A framework and benchmark for visual intuitive physics reasoning

Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, V´eronique Izard, and Emmanuel Dupoux. IntPhys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616, 2018

work page arXiv 2018

[21] [21]

Curvilinear motion in the absence of external forces: Naive beliefs about the motion of objects

Michael McCloskey, Alfonso Caramazza, and Bert Green. Curvilinear motion in the absence of external forces: Naive beliefs about the motion of objects. Science, 210(4474):1139–1141, 1980

work page 1980

[22] [22]

Intuitive physics

Michael McCloskey. Intuitive physics. Scientific american, 248(4):122–131, 1983

work page 1983

[23] [23]

Perception of partly occluded objects in infancy

Philip J Kellman and Elizabeth S Spelke. Perception of partly occluded objects in infancy. Cognitive psychology, 15(4):483–524, 1983

work page 1983

[24] [24]

Origins of knowledge

Elizabeth S Spelke, Karen Breinlinger, Janet Macomber, and Kristen Jacobson. Origins of knowledge. Psychological review, 99(4):605, 1992

work page 1992

[25] [25]

Spatiotemporal continuity, smoothness of motion and object identity in infancy

Elizabeth S Spelke, Roberta Kestenbaum, Daniel J Simons, and Debra Wein. Spatiotemporal continuity, smoothness of motion and object identity in infancy. British journal of developmental psychology, 13(2):113–142, 1995

work page 1995

[26] [26]

A theory of causal learning in children: causal maps and bayes nets

Alison Gopnik, Clark Glymour, David M Sobel, Laura E Schulz, Tamar Kushnir, and David Danks. A theory of causal learning in children: causal maps and bayes nets. Psychological review, 111(1):3, 2004

work page 2004

[27] [27]

The perception of causality in infancy

Rebecca Saxe and Susan Carey. The perception of causality in infancy. Acta psychologica, 123(1-2):144–165, 2006

work page 2006

[28] [28]

Learning to poke by poking: Experiential learning of intuitive physics

Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics. Advances in neural information processing systems, 29, 2016

work page 2016

[29] [29]

Intuitive physics: Current research and controversies

James R Kubricht, Keith J Holyoak, and Hongjing Lu. Intuitive physics: Current research and controversies. Trends in cognitive sciences, 21(10):749–759, 2017

work page 2017

[30] [30]

How to grow a mind: Statistics, structure, and abstraction

Joshua B Tenenbaum, Charles Kemp, Thomas L Griffiths, and Noah D Goodman. How to grow a mind: Statistics, structure, and abstraction. science, 331(6022):1279–1285, 2011

work page 2011

[31] [31]

Intuitive physics learning in a deep-learning model inspired by developmental psychology

Luis S Piloto, Ari Weinstein, Peter Battaglia, and Matthew Botvinick. Intuitive physics learning in a deep-learning model inspired by developmental psychology. Nature human behaviour, 6 (9):1257–1267, 2022

work page 2022

[32] [32]

Videophy: Evaluating physical commonsense for video generation, 2024

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Y arom, Y onatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation, 2024

work page 2024

[33] [33]

Towards world simulator: Crafting physical commonsense-based benchmark for video generation, 2024

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation, 2024

work page 2024

[34] [34]

Physgame: Uncovering physical commonsense violations in gameplay videos

Meng Cao, Haoran Tang, Haoze Zhao, Hangyu Guo, Jiaheng Liu, Ge Zhang, Ruyang Liu, Qiang Sun, Ian Reid, and Xiaodan Liang. Physgame: Uncovering physical commonsense violations in gameplay videos. arXiv preprint arXiv:2412.01800, 2024

work page arXiv 2024

[35] [35]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Y ogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Y ongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical AI. arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

LLMPhy: Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines

Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres. LLMPhy: Complex physical reasoning using large language models and world models. arXiv preprint arXiv:2411.08027, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Cophy: Counterfactual learning of physical dynamics

Fabien Baradel, Natalia Neverova, Julien Mille, Greg Mori, and Christian Wolf. Cophy: Counterfactual learning of physical dynamics. arXiv preprint arXiv:1909.12000, 2019

work page arXiv 1909

[38] [38]

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[39] [39]

Esprit: Explaining solutions to physical reasoning tasks

Nazneen Fatema Rajani, Rui Zhang, Yi Chern Tan, Stephan Zheng, Jeremy Weiss, Aadit Vyas, Abhijit Gupta, Caiming Xiong, Richard Socher, and Dragomir Radev. Esprit: Explaining solutions to physical reasoning tasks. arXiv preprint arXiv:2005.00730, 2020

work page arXiv 2005

[40] [40]

How far is video generation from world model: A physical law perspective, 2024

Bingyi Kang, Y ang Yue, Rui Lu, Zhijie Lin, Y ang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective, 2024

work page 2024

[41] [41]

Generative physical AI in vision: A survey

Daochang Liu, Junyu Zhang, Anh-Dung Dinh, Eunbyung Park, Shichao Zhang, and Chang Xu. Generative physical AI in vision: A survey. arXiv preprint arXiv:2501.10928, 2025

work page arXiv 2025

[42] [42]

Luma AI Team. Luma ai. https://lumalabs.ai, 2024. Generative AI platform specializing in 3D content and photorealistic modeling

work page 2024

[43] [43]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 9 Do generative video models understand physical principles?

work page 2010

[44] [44]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004

work page 2004

[45] [45]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[46] [46]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018

[47] [47]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024

[48] [48]

Vbench++: Comprehensive and versatile benchmark suite for video generative models

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint arXiv:2411.13503, 2024

work page arXiv 2024

[49] [49]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. arXiv preprint arXiv:2406.15252, 2024

work page arXiv 2024

[50] [50]

On the content bias in frechet video distance

Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Y an Zhu, and Jia-Bin Huang. On the content bias in frechet video distance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7277–7288, June 2024

work page 2024

[51] [51]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team Google: Petko Georgiev and 1133 other authors. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. URL https://arxiv.org/abs/2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Telling more than we can know: Verbal reports on mental processes

Richard E Nisbett and Timothy D Wilson. Telling more than we can know: Verbal reports on mental processes. Psychological review, 84(3):231, 1977

work page 1977

[53] [53]

Building machines that learn and think like people

Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017

work page 2017

[54] [54]

Tiered reasoning for intuitive physics: Toward verifiable commonsense language understanding

Shane Storks, Qiaozi Gao, Yichi Zhang, and Joyce Chai. Tiered reasoning for intuitive physics: Toward verifiable commonsense language understanding. In Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP) 2021, 2021

work page 2021

[55] [55]

Benchmarking progress to infant-level physical reasoning in ai

Luca Weihs, Amanda Yuile, Ren ´ee Baillargeon, Cynthia Fisher, Gary Marcus, Roozbeh Mottaghi, and Aniruddha Kembhavi. Benchmarking progress to infant-level physical reasoning in ai. Transactions on Machine Learning Research, 2022

work page 2022

[56] [56]

Using cognitive psychology to understand GPT -3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023

Marcel Binz and Eric Schulz. Using cognitive psychology to understand GPT -3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023

work page 2023

[57] [57]

GRASP: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models

Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. GRASP: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models. arXiv preprint arXiv:2311.09048, 2023

work page arXiv 2023

[58] [58]

Visual cognition in multimodal large language models

Luca M Schulze Buschoff, Elif Akata, Matthias Bethge, and Eric Schulz. Visual cognition in multimodal large language models. Nature Machine Intelligence, pages 1–11, 2025

work page 2025

[59] [59]

Vision language models are blind

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. In Proceedings of the Asian Conference on Computer Vision, pages 18–34, 2024

work page 2024

[60] [60]

A structural probe for finding syntax in word representations

John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, 2019

work page 2019

[61] [61]

A Survey of Hallucination in Large Foundation Models

Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[62] [62]

The origins of physical knowledge

Elizabeth S Spelke. The origins of physical knowledge. Clarendon Press/Oxford University Press, 1988

work page 1988

[63] [63]

The acquisition of physical knowledge in infancy: A summary in eight lessons

Ren ´ee Baillargeon. The acquisition of physical knowledge in infancy: A summary in eight lessons. Blackwell handbook of childhood cognitive development, pages 47–83, 2002

work page 2002

[64] [64]

Grounding intuitive physics in perceptual experience

Michele Vicovaro. Grounding intuitive physics in perceptual experience. Journal of Intelligence, 11(10):187, 2023

work page 2023

[65] [65]

A Compositional Object-Based Approach to Learning Physical Dynamics

Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[66] [66]

What tool representation, intuitive physics, and action have in common: The brain’s first-person physics engine.Cognitive neuropsychology, 38 (7-8):455–467, 2021

Jason Fischer and Bradford Z Mahon. What tool representation, intuitive physics, and action have in common: The brain’s first-person physics engine.Cognitive neuropsychology, 38 (7-8):455–467, 2021

work page 2021

[67] [67]

An approximate representation of objects underlies physical reasoning

Yichen Li, YingQiao Wang, Tal Boger, Kevin A Smith, Samuel J Gershman, and Tomer D Ullman. An approximate representation of objects underlies physical reasoning. Journal of Experimental Psychology: General, 2023

work page 2023

[68] [68]

Blending simulation and abstraction for physical reasoning

Felix A Sosa, Samuel J Gershman, and Tomer D Ullman. Blending simulation and abstraction for physical reasoning. Cognition, 254:105995, 2025

work page 2025

[69] [69]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. OpenAI o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[71] [71]

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Y ang, Y andong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv preprint arXiv:2501.09732, 2025. 10 Do generative video models understand physical principles? Supplementary Material Fig. 8. Illustration ...

work page internal anchor Pith review Pith/arXiv arXiv 2025