pith. sign in

arxiv: 2606.18943 · v1 · pith:A73MYTR6new · submitted 2026-06-17 · 💻 cs.CV

Physics-IQ Verified

Pith reviewed 2026-06-26 21:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generative modelsphysical understandingbenchmark auditprompt refinementground-truth qualitysample-level scoringmodel ranking
0
0 comments X

The pith

Refining prompts, ground-truth videos, and scoring in the Physics-IQ benchmark produces a more reliable measure of physical understanding in video generative models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a systematic audit of the Physics-IQ benchmark, which compares videos generated by models against real-world physical experiments to quantify physical understanding. It identifies shortcomings in prompt and ground-truth quality that allow confounding factors to influence results. Three targeted improvements are introduced: better prompts and ground-truth videos to reduce those factors, plus a new sample-level scoring system that gives equal weight to every sample and metric. The resulting Physics-IQ Verified benchmark revises 57.6 percent of samples and improves 34.8 percent of prompts. When six image-to-video models are re-evaluated, their rankings shift moderately but meaningfully.

Core claim

By improving prompt and ground-truth quality to reduce the influence of confounding factors and introducing a sample-level scoring system that weights each sample and metric equally, the benchmark provides a more faithful signal of physical understanding, as shown by the fact that 57.6 percent of samples are refined and model rankings change with Kendall's tau of 0.46.

What carries the argument

The sample-level scoring system that assigns equal weight to each sample and each metric.

If this is right

  • Improved prompt and ground-truth quality reduces the effect of non-physical factors on scores.
  • Equal weighting across samples and metrics produces rankings that more closely reflect physical understanding.
  • Moderate ranking changes indicate that earlier evaluations may have been distorted by confounding factors.
  • The updated benchmark supplies a clearer signal for developing video generative models that capture physical reality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models previously tuned to the original benchmark may need re-training or re-testing under the verified version.
  • Similar audit-and-refine steps could be applied to other video or world-modeling benchmarks.
  • Downstream tasks that rely on video models as world simulators may see different performance once the benchmark is updated.

Load-bearing premise

The revisions to prompts and ground-truth videos together with equal-weight sample-level scoring actually reduce confounding factors instead of merely changing which models score highest.

What would settle it

Re-evaluate the six image-to-video models on both the original and verified benchmarks using additional held-out real-world physical experiments and check whether ranking changes persist or disappear.

Figures

Figures reproduced from arXiv: 2606.18943 by Carsten T. L\"uth, Hilde Kuehne, Priyank Jaini, Robert Geirhos, Stefan Bauer, Tim R\"adsch, Yuki M Asano.

Figure 1
Figure 1. Figure 1: Key improvements from the original to the verified Physics-IQ evaluation. We propose three refinements to the original pipeline targeting: (1) prompt quality, (2) metric aggregation, and (3) spurious metric activations (artifacts). These improvements together sharpen the focus of the evaluation on physical understanding rather than confounding factors and also lead to a fine-grained understanding of the fi… view at source ↗
Figure 2
Figure 2. Figure 2: Examples of unclear prompt and artifact corrections in Physics-IQ Verified. (a) Unclear prompts reduce the ability of either a model or human to reliably predict the physical effect as key questions with respect to the movement are not addressed. Examples for each of the four categories in decreasing order of severity from left to right alongside our corrections. (b) Artifacts influence the binary activati… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of dataset modifications and issue distributions across the 198 benchmark videos. Of the 198 videos, 69 contain unclear prompts and 59 contain artifacts, with 20 videos belonging to both groups. (a) Video-level overview, with flows from all videos to unclear prompts and artifacts; prompt issue categories are shown as separate counts and may overlap across videos. (b) Frame-level composition, showi… view at source ↗
Figure 4
Figure 4. Figure 4: Full prompt improvement showcasing correction and templater. The original prompt does not adhere to the best-practices of the model providers. We address this by grouping the information contained in a prompt into six fields (each color denoting a separate field where SETUP & SCENE are merged for this cases). These fields can be used by custom templaters for each model, here visualized for Sora. The ACTION… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of Physics-IQ scores in its original and our proposed verified form. (a) Side-by-side comparison of final Physics-IQ scores for each model. For all models, with the exception of Wan 2.2, the scores increase for the verified evaluation. Sora 2 shows the largest increase in scores. T-denotes the standard deviations across four different runs. (b) Ranking bump plot highlighting the differences in r… view at source ↗
Figure 6
Figure 6. Figure 6: The Influence of Prompts and Artifacts on the resulting scores. (a) Prompts: All models with the exception of Wan 2.2 benefit from the inclusion of the best-practice prompts (bpp) over original prompts (op). Wan 2.2 is the only model for which the performance decreases. (b) Artifacts: Here denoted as original GT (with artifacts) and verified GT (without artifacts). All models show a reduction in absolute p… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison between a generation with the original prompt and verified prompt using Wan 2.2 to generate a static rubber duck on a wooden table. Using the original prompt a hand appears and interacts with the duck. The Best Practice Prompt has explicit description that nothing except the described phenomena occurs. Original Prompt: A stationary yellow rubber duck on a light brown wooden table against a plain… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison between a generation with the original prompt and verified prompt using p-video to generate a rotating teapot in front of a mirror. Using the original prompt the camera zooms in. The Best Practice Prompt has explicit description that the camera remains in position. Original Prompt: A teapot on a rotating display base that rotates clockwise in front of a mirror reflecting the teapot’s image. Stat… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison between a generation with the original prompt and verified prompt using HunyuanV-1.5 to generate a tennis ball hitting a rubber duck. Using the original prompt there is no information regarding speed and the ball stops. The Best Practice Prompt has as additional information a proxy for the speed of the ball. Original Prompt: A light beige coffee table with a small yellow rubber ducky on it. A mu… view at source ↗
Figure 10
Figure 10. Figure 10: Exemplary changes: Non-deterministic artifacts, here mainly grabber-related regions. Each column shows the first frame, the original aggregated activation map, the verified aggregated activation map, and the last frame. The grabber tools glow bright in the original activation map. However, their movement is unrelated to the physical effect: the falling objects. By removing both post-effect artifacts after… view at source ↗
Figure 11
Figure 11. Figure 11: Exemplary changes: Additional non-deterministic artifacts. Each column shows the first frame, the original aggregated activation map, the verified aggregated activation map, and the last frame. We use binary maps here because they better reveal smaller spatial changes and make more localized random effects easier to detect. The recording errors generate activations in the binary activation map. However, t… view at source ↗
Figure 12
Figure 12. Figure 12: Exemplary changes: Deterministic artifacts. Each column shows the first frame, the original aggregated activation map, the verified aggregated activation map, and the last frame. Note that we modified the improved prompt in these particular cases to stop the rotating base, once the effect has been set in motion. The rotators glow bright in the original activation map. However, their movement is unrelated … view at source ↗
Figure 13
Figure 13. Figure 13: Modification Overview. Tiles: Each tile represents one take-1 video from the 198-video evaluation set. Red marks activity removed after the annotated effect end; blue marks activity retained in the verified evaluation; grey indicates videos whose physical effect continues throughout the full duration. The error icons mark videos, where this specific error is present in the original version. 21 [PITH_FULL… view at source ↗
Figure 14
Figure 14. Figure 14: Key improvements from the original to the verified Physics-IQ evaluation. (a) Overview of the Physics-IQ evaluation pipeline, where a generative model produces video continua￾tions that are compared to a ground truth using three activation-based and one pixel-based metric, followed by aggregation into a final score. Light tile colors indicate corresponding elements of the same benchmark sample: one condit… view at source ↗
Figure 15
Figure 15. Figure 15: Visualization of the physical variance distribution [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Original vs. Verified Evaluation–Ranking comparison using bootstrapping. (a) Visu￾alization using a scatter plot, where large dots indicate the mean rank, while the smaller faint dots indicate the frequency with stronger color indicating more frequent ranks. Both the mean Spearman-ρ and Kendall-τ signal meaningful ranking differences. (b) Distributional assessment of correlation coefficients across evalua… view at source ↗
Figure 17
Figure 17. Figure 17: Comparison of Physics-IQ scores in their original and our proposed form. (a) Side￾by-side comparison of original and verified Physics-IQ scores for each model. All models have higher scores. T-denotes the standard deviations across four different runs. (b) Ranking bump plot showing no differences in ranking. (c) Bootstrap analysis ranking scatter plot. Large dots indicate the mean rank, while the smaller … view at source ↗
Figure 18
Figure 18. Figure 18: Ranking comparison using bootstrapping of Physics-IQ scores in their original and our proposed form. (a) Distributional assessment of correlation coefficients across evaluations and within. Rankings match almost perfect. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗
read the original abstract

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's $\tau = 0.46$). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an audit of the Physics-IQ benchmark for measuring physical understanding in video generative models. It identifies issues with prompt and ground-truth quality that introduce confounding factors, proposes three solutions including prompt/GT revisions and a new sample-level equal-weight scoring system, reports that the updated benchmark refines 57.6% of samples and improves 34.8% of prompts, and shows moderate ranking shifts (Kendall's τ = 0.46) across six image-to-video models. The revised benchmark and code are released publicly.

Significance. If the revisions and scoring changes demonstrably reduce confounding and yield a more faithful measure of physical understanding, the work would strengthen evaluation practices for world-modeling capabilities in VGMs. The public code release is a clear strength that supports reproducibility and community follow-up.

major comments (2)
  1. [Abstract] Abstract: The central claim that the revisions produce 'a more reliable signal' rests on the reported refinements (57.6% samples, 34.8% prompts) and ranking change (Kendall's τ = 0.46), yet no derivation, criteria for identifying confounding factors, or error bars are supplied. This absence directly undermines assessment of whether the changes improve measurement fidelity.
  2. [Abstract] Comparison study (abstract): The only quantitative support offered for reduced confounding is the moderate ranking shift across six models. No external validation criterion—such as correlation with expert-labeled physics violations or held-out real-world physical accuracy—is reported, leaving open the possibility that the new metric simply reshuffles scores without increasing faithfulness.
minor comments (1)
  1. [Abstract] Abstract: The Kendall's τ value is given without the number of models or ties considered; adding this context would improve interpretability of the reported ranking changes.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on the abstract's claims. We address the two major comments below regarding justification of refinements and external validation. We will revise the abstract to better qualify our statements without overstating the results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the revisions produce 'a more reliable signal' rests on the reported refinements (57.6% samples, 34.8% prompts) and ranking change (Kendall's τ = 0.46), yet no derivation, criteria for identifying confounding factors, or error bars are supplied. This absence directly undermines assessment of whether the changes improve measurement fidelity.

    Authors: The abstract is intentionally concise, but we agree it lacks explicit criteria and error bars. Section 3 of the manuscript details the audit criteria (e.g., prompt ambiguity, mismatched ground-truth physics, and metric inconsistencies) used to identify confounding factors through systematic review. We will revise the abstract to briefly summarize these criteria and add that ranking shifts are computed across six models without claiming statistical significance via error bars, as the focus is on observed changes rather than inference. revision: yes

  2. Referee: [Abstract] Comparison study (abstract): The only quantitative support offered for reduced confounding is the moderate ranking shift across six models. No external validation criterion—such as correlation with expert-labeled physics violations or held-out real-world physical accuracy—is reported, leaving open the possibility that the new metric simply reshuffles scores without increasing faithfulness.

    Authors: We agree that the Kendall's τ = 0.46 ranking shift alone does not constitute external validation of improved fidelity to physical understanding. The manuscript does not report correlations with expert labels or held-out real-world accuracy, as this work is an audit and refinement of the existing benchmark rather than a new validation study. We will revise the abstract to remove the phrase 'more reliable signal' and instead describe the outcome as 'refined benchmark with moderate ranking shifts,' accurately reflecting the evidence provided. revision: yes

standing simulated objections not resolved
  • No external validation (e.g., expert-labeled physics violations or real-world accuracy correlation) is available in the current work to directly confirm reduced confounding.

Circularity Check

0 steps flagged

No circularity: empirical benchmark audit with direct revisions, no equations or self-referential derivations

full rationale

The paper performs a manual audit of the existing Physics-IQ benchmark and applies direct revisions to prompts, ground-truth videos, and scoring (sample-level equal weighting). It reports the fraction of samples refined (57.6%) and prompts improved (34.8%), plus an observed Kendall τ ranking shift across six models. No equations, fitted parameters, predictions derived from prior outputs, or load-bearing self-citations appear in the provided text. The central claim rests on the explicit changes made rather than any reduction to the paper's own inputs by construction, making the work self-contained as an empirical refinement effort.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the domain assumption that the original benchmark contained identifiable confounding factors whose removal improves measurement validity; no free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption The original Physics-IQ benchmark contains confounding factors in prompts and ground-truth videos that can be systematically identified and corrected.
    Invoked when the authors state they improve prompt and ground-truth quality to reduce confounding influence.

pith-pipeline@v0.9.1-grok · 5777 in / 1261 out tokens · 26516 ms · 2026-06-26T21:12:01.592223+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 13 linked inside Pith

  1. [1]

    Do generative video mod- els understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

    Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video mod- els understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

  2. [2]

    Jürgen Schmidhuber.Making the world differentiable: on using self supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments, volume 126. Inst. für Informatik, 1990

  3. [3]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022

  4. [4]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  5. [5]

    Dreamgen: Unlocking generalization in robot learning through video world models

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. InConference on Robot Learning, pages 5170–5194. PMLR, 2025

  6. [6]

    Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

    Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

  7. [7]

    World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

  8. [8]

    FVD: A new metric for video generation, 2019

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. FVD: A new metric for video generation, 2019. URL https://openreview.net/ forum?id=rylgEULtdN

  9. [9]

    Fr \’echet video motion distance: A metric for evaluating motion consistency in videos.arXiv preprint arXiv:2407.16124, 2024

    Jiahe Liu, Youran Qu, Qi Yan, Xiaohui Zeng, Lele Wang, and Renjie Liao. Fr \’echet video motion distance: A metric for evaluating motion consistency in videos.arXiv preprint arXiv:2407.16124, 2024

  10. [10]

    A very big video reasoning suite.arXiv preprint arXiv:2602.20159, 2026

    Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, et al. A very big video reasoning suite.arXiv preprint arXiv:2602.20159, 2026

  11. [11]

    Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261, 2021

    Daniel M Bear, Elias Wang, Damian Mrowca, Felix J Binder, Hsiao-Yu Fish Tung, RT Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, et al. Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261, 2021

  12. [12]

    Hsiao-Yu Tung, Mingyu Ding, Zhenfang Chen, Daniel Bear, Chuang Gan, Josh Tenenbaum, Dan Yamins, Judith Fan, and Kevin Smith. Physion++: Evaluating physical scene understanding that requires online inference of different physical properties.Advances in Neural Information Processing Systems, 36: 67048–67068, 2023

  13. [13]

    Craft: A benchmark for causal reasoning about forces and interactions

    Tayfun Ates, M Ate¸ so˘glu, Ça ˘gatay Yi˘git, Ilker Kesen, Mert Kobas, Erkut Erdem, Aykut Erdem, Tilbe Goksun, and Deniz Yuret. Craft: A benchmark for causal reasoning about forces and interactions. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2602–2627, 2022. 10

  14. [14]

    Intphys: A framework and benchmark for visual intuitive physics reasoning.arXiv preprint arXiv:1803.07616, 2018

    Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. Intphys: A framework and benchmark for visual intuitive physics reasoning.arXiv preprint arXiv:1803.07616, 2018

  15. [15]

    Cophy: Counterfactual learning of physical dynamics.arXiv preprint arXiv:1909.12000, 2019

    Fabien Baradel, Natalia Neverova, Julien Mille, Greg Mori, and Christian Wolf. Cophy: Counterfactual learning of physical dynamics.arXiv preprint arXiv:1909.12000, 2019

  16. [16]

    Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019

    Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019

  17. [17]

    Esprit: Explaining solutions to physical reasoning tasks

    Nazneen Fatema Rajani, Rui Zhang, Yi Chern Tan, Stephan Zheng, Jeremy Weiss, Aadit Vyas, Abhijit Gupta, Caiming Xiong, Richard Socher, and Dragomir Radev. Esprit: Explaining solutions to physical reasoning tasks. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7906–7917, 2020

  18. [18]

    How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

  19. [19]

    Phyre: A new benchmark for physical reasoning.Advances in Neural Information Processing Systems, 32, 2019

    Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick. Phyre: A new benchmark for physical reasoning.Advances in Neural Information Processing Systems, 32, 2019

  20. [20]

    Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  21. [21]

    Improving the physics of video generation with vjepa-2 reward signal.arXiv preprint arXiv:2510.21840, 2025

    Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari- Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, and Adriana Romero-Soriano. Improving the physics of video generation with vjepa-2 reward signal.arXiv preprint arXiv:2510.21840, 2025

  22. [22]

    Inference-time physics alignment of video generative models with latent world models.arXiv preprint arXiv:2601.10553, 2026

    Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari- Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, and Adriana Romero-Soriano. Inference-time physics alignment of video generative models with latent world models.arXiv preprint arXiv:2601.10553, 2026

  23. [23]

    Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  24. [24]

    Sora 2 system card openai september 30, 2025 1, Sep 2025

    Open AI. Sora 2 system card openai september 30, 2025 1, Sep 2025. URL https://cdn.openai.com/ pdf/50d5973c-c4ff-4c2d-986f-c72b5d0ff069/sora_2_system_card.pdf

  25. [25]

    Video-gpt via next clip diffusion.arXiv preprint arXiv:2505.12489, 2025

    Shaobin Zhuang, Zhipeng Huang, Ying Zhang, Fangyikang Wang, Canmiao Fu, Binxin Yang, Chong Sun, Chen Li, and Yali Wang. Video-gpt via next clip diffusion.arXiv preprint arXiv:2505.12489, 2025

  26. [26]

    Bootstrapping physics-grounded video generation through vlm-guided iterative self-refinement.arXiv preprint arXiv:2511.20280, 2025

    Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, and Qingming Huang. Bootstrapping physics-grounded video generation through vlm-guided iterative self-refinement.arXiv preprint arXiv:2511.20280, 2025

  27. [27]

    Phys4d: Fine-grained physics-consistent 4d modeling from video diffusion.arXiv preprint arXiv:2603.03485, 2026

    Haoran Lu, Shang Wu, Jianshu Zhang, Maojiang Su, Guo Ye, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, et al. Phys4d: Fine-grained physics-consistent 4d modeling from video diffusion.arXiv preprint arXiv:2603.03485, 2026

  28. [28]

    VideoPhy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. VideoPhy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024

  29. [29]

    Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

  30. [30]

    Language models are not naysayers: an analysis of language models on negation benchmarks

    Thinh Hung Truong, Timothy Baldwin, Karin Verspoor, and Trevor Cohn. Language models are not naysayers: an analysis of language models on negation benchmarks. InProceedings of the 12th Joint Conference on Lexical and Computational Semantics (* SEM 2023), pages 101–114, 2023

  31. [31]

    This is not a dataset: A large negation benchmark to challenge large language models

    Iker García-Ferrero, Begoña Altuna, Javier Alvez, Itziar Gonzalez-Dios, and German Rigau. This is not a dataset: A large negation benchmark to challenge large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 8596–8615, 2023. 11

  32. [32]

    Valse: A task-independent benchmark for vision and language models centered on linguistic phenomena

    Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan, Anette Frank, Iacer Calixto, and Albert Gatt. Valse: A task-independent benchmark for vision and language models centered on linguistic phenomena. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8253–8280, 2022

  33. [33]

    Vision-language models do not understand negation

    Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip HS Torr, Yoon Kim, and Marzyeh Ghassemi. Vision-language models do not understand negation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29612–29622, 2025

  34. [34]

    Relations, negations, and numbers: Looking for logic in generative text-to-image models.arXiv preprint arXiv:2411.17066, 2024

    Colin Conwell, Rupert Tawiah-Quashie, and Tomer Ullman. Relations, negations, and numbers: Looking for logic in generative text-to-image models.arXiv preprint arXiv:2411.17066, 2024

  35. [35]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, Accessed: 2026-04-29, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, Accessed: 2026-04-29, 2025

  36. [36]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, Accessed: 2026-04-29, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, Accessed: 2026-04-29, 2024

  37. [37]

    Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

    Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, et al. Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

  38. [38]

    URL https://www.pruna.ai/

    Efficient machine learning with pruna, 2023. URL https://www.pruna.ai/. Software available from pruna.ai, Accessed: 2026-04-29

  39. [39]

    Grok Imagine API: State-of-the-art video generation across quality, cost, and latency

    xAI. Grok Imagine API: State-of-the-art video generation across quality, cost, and latency. https: //x.ai/news/grok-imagine-api, 2026. Accessed: 2026-04-29

  40. [40]

    The treatment of ties in ranking problems.Biometrika, 33(3):239–251, 1945

    Maurice G Kendall. The treatment of ties in ranking problems.Biometrika, 33(3):239–251, 1945

  41. [41]

    The proof and measurement of association between two things

    Charles Spearman. The proof and measurement of association between two things. 1961

  42. [42]

    Lawrence Erlbaum Associates, Inc, 1977

    Jacob Cohen.Statistical power analysis for the behavioral sciences, Rev. Lawrence Erlbaum Associates, Inc, 1977

  43. [43]

    Individual comparisons by ranking methods.Biometrics bulletin, 1(6):80–83, 1945

    Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics bulletin, 1(6):80–83, 1945

  44. [44]

    Statistical comparisons of classifiers over multiple data sets.Journal of Machine learning research, 7(Jan):1–30, 2006

    Janez Demšar. Statistical comparisons of classifiers over multiple data sets.Journal of Machine learning research, 7(Jan):1–30, 2006

  45. [45]

    Paradoxical effects of thought suppression.Journal of personality and social psychology, 53(1):5, 1987

    Daniel M Wegner, David J Schneider, Samuel R Carter, and Teri L White. Paradoxical effects of thought suppression.Journal of personality and social psychology, 53(1):5, 1987

  46. [46]

    Phyt2v: Llm-guided iterative self-refinement for physics-grounded text-to-video generation

    Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self-refinement for physics-grounded text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826–18836, 2025

  47. [47]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogniti...

  48. [48]

    VBench++: Comprehensive and versatile benchmark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024

  49. [49]

    EvalCrafter: Benchmarking and evaluating large video generation models

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. EvalCrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  50. [50]

    VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, and Ziwei Liu. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 12

  51. [51]

    What you see is what matters: A novel visual and physics-based metric for evaluating video generation quality.arXiv preprint arXiv:2411.13609, 2024

    Zihan Wang, Songlin Li, Lingyan Hao, Xinyu Hu, and Bowen Song. What you see is what matters: A novel visual and physics-based metric for evaluating video generation quality.arXiv preprint arXiv:2411.13609, 2024

  52. [52]

    Static shot with no camera movement

    Chenyu Zhang, Daniil Cherniavskii, Antonios Tragoudaras, Antonios V ozikis, Thijmen Nijdam, Derck WE Prinzhorn, Mark Bodracska, Nicu Sebe, Andrii Zadaianchuk, and Efstratios Gavves. Morpheus: Bench- marking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025. 13 Appendix Table of Contents A Pr...

  53. [53]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...