GEOPHYS: The Geometry of Physical Plausibility
Pith reviewed 2026-06-27 03:32 UTC · model grok-4.3
The pith
Five geometric properties of embeddings from frozen image encoders detect physically implausible videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Indicators of physical plausibility are implicitly captured by five geometric properties of the per-frame embeddings produced by frozen image encoders. In aggregate these properties, called GEOPHYS, correlate with human EEG responses to object-permanence violations, discriminate physically implausible videos from realistic ones at state-of-the-art rates, and function as a low-cost verifier that improves physical alignment during video generation.
What carries the argument
GEOPHYS: the aggregate of five geometric properties of per-frame embeddings from frozen image encoders.
If this is right
- GEOPHYS separates implausible from realistic videos at 98.3 percent on LikePhys and 93.3 percent on IntPhys2.
- The same signals outperform V-JEPA 2, GPT-4o, Gemini, and twelve modern video diffusion models on physics-violation detection.
- When used as a best-of-N verifier, GEOPHYS raises MAGI-1 24B performance on PhysicsIQ from 50.01 percent to 64.50 percent.
- Physical plausibility assessment becomes possible from emergent geometry in temporal features of image encoders without ad-hoc training.
Where Pith is reading between the lines
- If the geometric signals prove stable across domains, they could support lightweight physics filters inside real-time video pipelines.
- The reported EEG correlation invites direct comparison between encoder geometry and human perceptual timing on matched stimuli.
- Video generators might incorporate the five properties as an internal consistency loss rather than relying on external verifiers.
- New benchmarks that isolate each geometric property could map which aspect tracks which class of physical violation.
Load-bearing premise
The five geometric properties of per-frame embeddings from frozen image encoders implicitly capture indicators of physical plausibility without requiring task-specific training or external models.
What would settle it
Videos containing clear physical violations in which the five geometric properties show no measurable difference from those in matched realistic videos would falsify the central claim.
Figures
read the original abstract
While humans can identify physically implausible events within milliseconds, machine learning approaches addressing the same problem are extremely slow and expensive. They either rely on external multimodal-LLM judges or require ad-hoc modifications to the training procedure. In this work, we argue that indicators of physical plausibility are implicitly captured by five geometric properties of the per-frame embeddings produced by frozen image encoders. In aggregate, we call them GEOPHYS. First, we show that these signals correlate with human EEG responses to two forms of object-permanence violations. Second, GEOPHYS robustly discriminates physically implausible videos from realistic ones, achieving state-of-the-art physics-violation detection: 98.3% on LikePhys and 93.3% on IntPhys2, whereas V-JEPA 2, GPT-4o, Gemini, and twelve modern video diffusion models perform near chance. Third, used as a best-of-N verifier for physical alignment during video generation, GEOPHYS lifts MAGI-1 24B from 50.01% to 64.50% on PhysicsIQ at 1.5x lower wall-clock and 4.65x lower memory than the V-JEPA 2 world-model verifier. Ultimately, GEOPHYS demonstrates that physical plausibility in videos can be assessed by leveraging the emergent geometric properties of temporal features extracted from image encoders.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GEOPHYS, consisting of five geometric properties of per-frame embeddings from frozen image encoders, which are argued to implicitly capture indicators of physical plausibility without task-specific training. Evidence includes correlation with human EEG responses to object-permanence violations, state-of-the-art discrimination of implausible videos (98.3% on LikePhys, 93.3% on IntPhys2) where baselines including V-JEPA 2, GPT-4o, Gemini, and twelve video diffusion models perform near chance, and improved best-of-N verification for video generation (lifting MAGI-1 from 50.01% to 64.50% on PhysicsIQ at reduced compute).
Significance. If the central claim holds, the result is significant: it shows physical plausibility detection can leverage emergent geometric signals in standard frozen encoders, avoiding expensive LLM judges or ad-hoc training. Credit is due for the three independent lines of evidence (EEG correlation, benchmark discrimination, and generation verifier), the parameter-free nature, and the efficiency gains (1.5x lower wall-clock, 4.65x lower memory than V-JEPA 2). This could impact video understanding and generation pipelines.
major comments (2)
- [Methods] The abstract and summary claim the five properties are defined without free parameters or fitting, but the exact definitions, formulas, and aggregation method for GEOPHYS must be specified with equations in the methods section to allow verification that they are independently derived and not circular.
- [Experiments] Table or section reporting the 98.3% / 93.3% accuracies: the evaluation protocol (e.g., how per-frame embeddings are processed into video-level scores, dataset splits, and whether any hyperparameter tuning occurred) is load-bearing for the SOTA claim and must be detailed with controls for dataset bias.
minor comments (2)
- [Abstract] The abstract refers to 'five geometric properties' without naming them (e.g., what specific geometric measures like distances, angles, or curvatures); adding names or a brief list would improve clarity.
- [EEG Experiments] The EEG correlation experiment should include the number of subjects, trial counts, and statistical measures (e.g., correlation coefficients with p-values) for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the potential significance of GEOPHYS. We address the two major comments below and will incorporate the requested clarifications into the revised manuscript.
read point-by-point responses
-
Referee: [Methods] The abstract and summary claim the five properties are defined without free parameters or fitting, but the exact definitions, formulas, and aggregation method for GEOPHYS must be specified with equations in the methods section to allow verification that they are independently derived and not circular.
Authors: We agree that explicit equations are necessary for full verification. The five properties are defined from first principles on the geometry of the embedding manifold (trajectory curvature, pairwise cosine distances, volume of the convex hull of consecutive embeddings, eigenvalue spread of the temporal covariance, and a normalized displacement norm), with no learned parameters or data-dependent thresholds. In the revision we will add a dedicated Methods subsection containing the precise mathematical definitions, the closed-form aggregation into the scalar GEOPHYS score, and a short proof sketch confirming independence from any fitting procedure. revision: yes
-
Referee: [Experiments] Table or section reporting the 98.3% / 93.3% accuracies: the evaluation protocol (e.g., how per-frame embeddings are processed into video-level scores, dataset splits, and whether any hyperparameter tuning occurred) is load-bearing for the SOTA claim and must be detailed with controls for dataset bias.
Authors: The current manuscript already states that no hyper-parameters are tuned and that the same fixed geometric thresholds are used across all datasets, but we accept that a more explicit protocol description is warranted. In the revision we will expand the Experiments section with: (i) the exact per-frame to video-level aggregation rule, (ii) the precise train/test splits employed for LikePhys and IntPhys2, (iii) confirmation that zero hyper-parameter search was performed, and (iv) additional bias controls (shuffled-label baselines and cross-dataset transfer results). These additions will strengthen the SOTA claim without altering any reported numbers. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper presents five geometric properties of per-frame embeddings from frozen image encoders as emergent indicators of physical plausibility. Validation proceeds via independent empirical lines: correlation with human EEG responses to object-permanence violations, 98.3% and 93.3% accuracy on LikePhys and IntPhys2 (where strong baselines fail near chance), and measurable gains as a best-of-N verifier on PhysicsIQ. No equations, definitions, or self-citations in the provided text reduce any claimed result to a fitted input, self-defined quantity, or author-prior ansatz by construction. The approach is self-contained against external benchmarks without load-bearing internal fitting or renaming.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jianhao Yuan, Fabio Pizzati, Francesco Pinto, Lars Kunze, Ivan Laptev, Paul Newman, Philip Torr, and Daniele De Martini. LikePhys: Evaluating intuitive physics understanding in video diffusion models via likelihood preference.arXiv preprint arXiv:2510.11512, 2025. (Cited on pages 1, 2, 2, 2, 3, 5, 5, 5, 6, 6, 7, 7, 7, 8, and 21.)
arXiv 2025
-
[2]
Florian Bordes, Quentin Garrido, Justine T Kao, Adina Williams, Michael Rabbat, and Em- manuel Dupoux. IntPhys 2: Benchmarking intuitive physics understanding in complex synthetic environments.arXiv preprint arXiv:2506.09849, 2025. (Cited on pages 1, 2, 2, 2, 3, 5, 6, 6, 7, 7, 7, 7, 8, 18, 21, and 22.)
arXiv 2025
-
[3]
Do generative video models understand physical principles?, 2025
Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?, 2025. URL https://arxiv.org/abs/2501. 09038. (Cited on pages 1, 2, 2, 3, 5, 8, and 22.)
2025
-
[4]
The violation-of-expectation paradigm: A conceptual overview.Psychological Review, 131(3):716, 2024
Francesco Margoni, Luca Surian, and Renée Baillargeon. The violation-of-expectation paradigm: A conceptual overview.Psychological Review, 131(3):716, 2024. (Cited on pages 1 and 5.)
2024
-
[5]
Investigating looking and social looking measures as an index of infant violation of expectation.Developmental Science, 20(6):e12452, 2017
Kirsty Dunn and J Gavin Bremner. Investigating looking and social looking measures as an index of infant violation of expectation.Developmental Science, 20(6):e12452, 2017. (Cited on pages 1 and 5.)
2017
-
[6]
Electrophysiology reveals that intuitive physics guides visual tracking and working memory.Open Mind, 8: 1425–1446, 2024
Halely Balaban, Kevin A Smith, Joshua B Tenenbaum, and Tomer D Ullman. Electrophysiology reveals that intuitive physics guides visual tracking and working memory.Open Mind, 8: 1425–1446, 2024. (Cited on pages 1, 2, 5, 5, 5, 6, 6, 6, 6, 10, 20, and 21.)
2024
-
[7]
Shari Liu, Kirsten Lydic, Lingjie Mei, and Rebecca Saxe. Violations of physical and psycholog- ical expectations in the human adult brain.Imaging Neuroscience, 2:imag–2–00068, 02 2024. ISSN 2837-6056. doi: 10.1162/imag_a_00068. URL https://doi.org/10.1162/imag_a_ 00068. (Cited on pages 1, 2, 2, 2, 5, 10, and 21.)
-
[8]
Intuitive physics understanding emerges from self-supervised pretraining on natural videos, 2025
Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, and Yann LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos, 2025. URL https://arxiv.org/abs/2502. 11831. (Cited on pages 1 and 3.)
2025
-
[9]
Peiyao Wang, Weining Wang, and Qi Li. Physcorr: Dual-reward dpo for physics-constrained text-to-video generation with automated preference selection, 2025. URL https://arxiv. org/abs/2511.03997. (Cited on page 1.)
arXiv 2025
-
[10]
Inference-time physics alignment of video generative models with latent world models
Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, and Adriana Romero- Soriano. Inference-time physics alignment of video generative models with latent world models. arXiv preprint arXiv:2601.10553, 2026. (Cited on pages 1, 2, 2, 9, 9, 9, 9, and 26.) G...
arXiv 2026
-
[11]
V-jepa 2: Self-supervised video models enable understanding, prediction and planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xia...
-
[12]
(Cited on pages 1, 2, 2, 2, 6, 7, 7, and 9.)
URL https://arxiv.org/abs/2506.09985. (Cited on pages 1, 2, 2, 2, 6, 7, 7, and 9.)
-
[13]
Travl: A recipe for making video-language models better judges of physics implausibility, 2025
Saman Motamed, Minghao Chen, Luc Van Gool, and Iro Laina. Travl: A recipe for making video-language models better judges of physics implausibility, 2025. URL https://arxiv. org/abs/2510.07550. (Cited on pages 1 and 2.)
arXiv 2025
-
[14]
Tenenbaum, and Tomer D
Kevin Smith, Lingjie Mei, Shunyu Yao, Jiajun Wu, Elizabeth Spelke, Joshua B. Tenenbaum, and Tomer D. Ullman. Modeling expectation violation in intuitive physics with coarse probabilistic object representations. InAdvances in Neural Information Processing Systems (NeurIPS), volume 32, 2019. (Cited on pages 1 and 5.)
2019
-
[15]
Interaction networks for learning about objects, relations and physics
Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, and Koray kavukcuoglu. Interaction networks for learning about objects, relations and physics. InProceed- ings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 4509–4517, Red Hook, NY , USA, 2016. Curran Associates Inc. ISBN 9781510838819. (Ci...
2016
-
[16]
Galileo: Perceiving physical object properties by integrating a physics engine with deep learn- ing
Jiajun Wu, Ilker Yildirim, Joseph J Lim, Bill Freeman, and Josh Tenenbaum. Galileo: Perceiving physical object properties by integrating a physics engine with deep learn- ing. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://pro...
2015
-
[17]
Lighting (in)consistency of paint by text, 2022
Hany Farid. Lighting (in)consistency of paint by text, 2022. URL https://arxiv.org/abs/ 2207.13744. (Cited on page 1.)
arXiv 2022
-
[18]
Eric Kee, James F. O’brien, and Hany Farid. Exposing photo manipulation from shading and shadows.ACM Trans. Graph., 33(5), September 2014. ISSN 0730-0301. doi: 10.1145/2629646. URLhttps://doi.org/10.1145/2629646. (Cited on page 1.)
-
[19]
Distinguishing authentic from ai-generated explosions using spatiotemporal dynamics
Sarah Barrington and Hany Farid. Distinguishing authentic from ai-generated explosions using spatiotemporal dynamics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 10659–10667, June 2026. (Cited on page 1.)
2026
-
[20]
Perspective (in)consistency of paint by text, 2022
Hany Farid. Perspective (in)consistency of paint by text, 2022. URL https://arxiv.org/ abs/2206.14617. (Cited on page 1.)
arXiv 2022
-
[21]
Learning invariance from transformation sequences.Neural Computation, 3(2): 194–200, 1991
Peter Földiák. Learning invariance from transformation sequences.Neural Computation, 3(2): 194–200, 1991. (Cited on page 2 and 2.)
1991
-
[22]
Laurenz Wiskott and Terrence J. Sejnowski. Slow feature analysis: Unsupervised learning of invariances.Neural Computation, 14(4):715–770, 04 2002. ISSN 0899-7667. doi: 10.1162/ 089976602317318938. URL https://doi.org/10.1162/089976602317318938. (Cited on page 2 and 2.)
-
[23]
Tangent prop: A formalism for specifying selected invariances in an adaptive network
Patrice Simard, Bernard Victorri, Yann LeCun, and John Denker. Tangent prop: A formalism for specifying selected invariances in an adaptive network. InAdvances in Neural Information Processing Systems, volume 4, pages 895–903, 1991. (Cited on page 2 and 2.)
1991
-
[24]
Learning lie groups for invariant visual perception
Rajesh Rao and Daniel Ruderman. Learning lie groups for invariant visual perception. In M. Kearns, S. Solla, and D. Cohn, editors,Advances in Neural Information Processing Systems, volume 11. MIT Press, 1998. URL https://proceedings.neurips.cc/paper_files/ paper/1998/file/277281aada22045c03945dcb2ca6f2ec-Paper.pdf. (Cited on page 2 and 2.) GEOPHYS: The Ge...
1998
-
[25]
Learning to linearize under uncertainty
Ross Goroshin, Michael Mathieu, and Yann LeCun. Learning to linearize under uncertainty
-
[26]
URLhttps://arxiv.org/abs/1506.03011. (Cited on pages 2, 2, and 4.)
-
[27]
Olivier J. Hénaff and Eero P. Simoncelli. Geodesics of learned representations.CoRR, abs/1511.06394, 2015. URL https://api.semanticscholar.org/CorpusID:2208884. (Cited on page 2.)
arXiv 2015
-
[28]
Learning predictable and robust neural representations by straightening image sequences.Advances in Neural Information Processing Systems, 37:40316–40335, 2024
Xueyan Niu, Cristina Savin, and Eero P Simoncelli. Learning predictable and robust neural representations by straightening image sequences.Advances in Neural Information Processing Systems, 37:40316–40335, 2024. (Cited on page 2 and 2.)
2024
-
[29]
Perceptual straightening of natural videos.Nature Neuroscience, 22(6):984–991, 2019
Olivier J Hénaff, Robbe LT Goris, and Eero P Simoncelli. Perceptual straightening of natural videos.Nature Neuroscience, 22(6):984–991, 2019. (Cited on pages 2, 3, 5, 18, and 18.)
2019
-
[30]
Primary visual cortex straightens natural video trajectories
Olivier J Hénaff, Yoon Bai, Julie A Charlton, Ian Nauhaus, Eero P Simoncelli, and Robbe LT Goris. Primary visual cortex straightens natural video trajectories. InNature Communications,
-
[31]
(Cited on pages 2, 2, 3, 5, and 18.)
-
[32]
Anne Harrington, Vasha DuTell, Ayush Tewari, Mark Hamilton, Simon Stent, Ruth Rosenholtz, and William T. Freeman. Exploring perceptual straightness in learned visual representations. InThe Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=4cOfD2qL6T. (Cited on page 2 and 2.)
2023
-
[33]
Zonneveld, Pascal Mettes, and Iris Groen
Anne W. Zonneveld, Pascal Mettes, and Iris Groen. Straightening of natural visual sequences in video DNNs: the role of locality and temporal coherence. InCognitive Computational Neuroscience (CCN), Amsterdam, The Netherlands, 2025. URL https://2025.ccneuro. org/poster/?id=EJMZ9Os8jG. (Cited on pages 2, 2, and 3.)
2025
-
[34]
Chang, Ashesh Rambachan, and Sendhil Mullainathan
Keyon Vafa, Peter G. Chang, Ashesh Rambachan, and Sendhil Mullainathan. What has a foun- dation model found? inductive bias reveals world models. InForty-second International Confer- ence on Machine Learning, 2025. URLhttps://openreview.net/forum?id=i9npQatSev. (Cited on pages 2 and 11.)
2025
-
[35]
Christian Internò, Jumpei Yamaguchi, Loren Amdahl-Culleton, Markus Olhofer, David Klindt, and Barbara Hammer. The observer effect in world models: Invasive adaptation corrupts latent physics.arXiv preprint arXiv:2602.12218, 2026. (Cited on pages 2 and 11.)
arXiv 2026
-
[36]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...
2024
-
[37]
Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...
Pith/arXiv arXiv 2025
-
[38]
Brain-like object recognition with high-performing shallow recurrent ANNs
Jonas Kubilius, Martin Schrimpf, Kohitij Kar, Rishi Rajalingham, Ha Hong, Najib Majaj, Elias Issa, Pouya Bashivan, Jonathan Prescott-Roy, Kailyn Schmidt, et al. Brain-like object recognition with high-performing shallow recurrent ANNs. InAdvances in Neural Information Processing Systems, 2019. (Cited on pages 2, 5, 7, 7, and 18.)
2019
-
[39]
Simulating a primary visual cortex at the front of CNNs improves robustness to image perturbations
Joel Dapello, Tiago Marques, Martin Schrimpf, Franziska Geiger, David Cox, and James J DiCarlo. Simulating a primary visual cortex at the front of CNNs improves robustness to image perturbations. InAdvances in Neural Information Processing Systems, 2020. (Cited on pages 2, 3, 5, 7, 7, and 19.) GEOPHYS: The Geometry of Physical Plausibility 14
2020
-
[40]
Eeg decoding reveals neural predictions for naturalistic material behaviors.Journal of Neuroscience, 43(29):5406–5413, 2023
Daniel Kaiser, Rico Stecher, and Katja Doerschner. Eeg decoding reveals neural predictions for naturalistic material behaviors.Journal of Neuroscience, 43(29):5406–5413, 2023. (Cited on pages 2, 5, and 21.)
2023
-
[41]
OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774. (Cited on pages 2, 2, and 7.)
Pith/arXiv arXiv 2024
-
[42]
Gemini: A family of highly capable multimodal models, 2025
Gemini Team. Gemini: A family of highly capable multimodal models, 2025. URL https: //arxiv.org/abs/2312.11805. (Cited on pages 2, 2, 7, and 7.)
Pith/arXiv arXiv 2025
-
[43]
MAGI-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211,
Sand AI. MAGI-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211,
-
[44]
(Cited on pages 2 and 8.)
-
[45]
Bear, Elias Wang, Damian Mrowca, Felix J
Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiao-Yu Fish Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, Li Fei-Fei, Nancy Kanwisher, Joshua B. Tenenbaum, Daniel L. K. Yamins, and Judith E. Fan. Physion: Evaluating physical prediction from vision in humans and machines, 2022. URL https: //arxiv.org/abs/2106.0826...
arXiv 2022
-
[46]
IntPhys: A benchmark for visual intuitive physics reasoning
Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. IntPhys: A benchmark for visual intuitive physics reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5016–5025, 2022. (Cited on page 2.)
2022
-
[47]
Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models
Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models. 2024. URL https://arxiv.org/abs/2311. 09048. (Cited on page 2.)
2024
-
[48]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...
Pith/arXiv arXiv 2025
-
[49]
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. (Cited on pages 2, 6, and 7.)
Pith/arXiv arXiv 2024
-
[50]
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. (Cited on pages 2, 7, 7, 7, and 8.)
Pith/arXiv arXiv 2024
-
[51]
VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=AhccnBXSne. (Cited on pages 2, 2, 7, and 8.)
2022
-
[52]
Video- t1: Test-time scaling for video generation.arXiv preprint arXiv:2503.18942, 2025
Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, and Yueqi Duan. Video- t1: Test-time scaling for video generation.arXiv preprint arXiv:2503.18942, 2025. (Cited on pages 2 and 8.)
arXiv 2025
-
[53]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin GEOPHYS: The Geometry of Physical Plausibility 15 Yang, Jiaxi Yang, Jing Zhou, Jingren Zh...
Pith/arXiv arXiv 2025
-
[54]
Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation
Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bohan Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Yuchen Lin, and Wenhu Chen. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. In Procee...
2024
-
[55]
Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024
Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024. (Cited on page 2.)
Pith/arXiv arXiv 2024
-
[56]
Vigor: Video geometry-oriented reward for temporal generative alignment, 2026
Tengjiao Yin, Jinglei Shi, Heng Guo, and Xi Wang. Vigor: Video geometry-oriented reward for temporal generative alignment, 2026. URL https://arxiv.org/abs/2603.16271. (Cited on page 2.)
arXiv 2026
-
[57]
Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
-
[58]
Let’s verify step by step
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InProceedings of ICLR, 2024. (Cited on page 2.)
2024
-
[59]
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,
-
[60]
(Cited on pages 2 and 10.)
-
[61]
Le, Christopher Ré, and Azalia Mirhoseini
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024. (Cited on pages 2 and 10.)
Pith/arXiv arXiv 2024
-
[62]
A general framework for inference-time scaling and steering of diffusion models
Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. InForty-second International Conference on Machine Learning, 2025. URL https: //openreview.net/forum?id=Jp988ELppQ. (Cited on pages 2 and 10.)
2025
-
[63]
Inference-time text-to- video alignment with diffusion latent beam search
Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, and Hiroki Furuta. Inference-time text-to- video alignment with diffusion latent beam search. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= c9EAmyYPOv. (Cited on page 2.)
2025
-
[64]
Verifier matters: Enhancing inference-time scaling for video diffusion models
Lorenzo Baraldi, Davide Bucciarelli, Zifan Zeng, Chongzhe Zhang, Qunli Zhang, Marcella Cornia, Lorenzo Baraldi, Feng Liu, Zheng Hu, and Rita Cucchiara. Verifier matters: Enhancing inference-time scaling for video diffusion models. In36th British Machine Vision Confer- ence 2025, BMVC 2025, Sheffield, UK, November 24-27, 2025. BMV A, 2025. URL https: //bmv...
2025
-
[65]
Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.Nature Neuroscience, 2(1):79–87,
Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.Nature Neuroscience, 2(1):79–87,
-
[66]
The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010
Karl Friston. The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010. (Cited on page 2.) GEOPHYS: The Geometry of Physical Plausibility 16
2010
-
[67]
Ai-generated video detection via perceptual straightening
Christian Internò, Robert Geirhos, Markus Olhofer, Sunny Liu, Barbara Hammer, and David Klindt. Ai-generated video detection via perceptual straightening. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 20672–20705. Curran Asso- ciates, Inc., 20...
2025
-
[68]
Grab-3d: Detecting ai-generated videos from 3d geometric temporal consistency, 2025
Wenhan Chen, Sezer Karaoglu, and Theo Gevers. Grab-3d: Detecting ai-generated videos from 3d geometric temporal consistency, 2025. URLhttps://arxiv.org/abs/2512.13665. (Cited on page 3.)
arXiv 2025
-
[69]
Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018
Blender Online Community.Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018. URL http://www.blender. org. (Cited on pages 3 and 7.)
2018
-
[70]
Video diffusion models: A survey, 2024
Andrew Melnik, Michal Ljubljanac, Cong Lu, Qi Yan, Weiming Ren, and Helge Ritter. Video diffusion models: A survey, 2024. URL https://arxiv.org/abs/2405.03150. (Cited on page 3.)
arXiv 2024
-
[71]
Issa, and James J
Kohitij Kar, Jonas Kubilius, Kailyn Schmidt, Elias B. Issa, and James J. DiCarlo. Evidence that recurrent circuits are critical to the ventral stream’s execution of core object recognition behavior.Nature Neuroscience, 22(6):974–983, 2019. (Cited on page 5.)
2019
-
[72]
David J. Heeger. Normalization of cell responses in cat striate cortex.Visual Neuroscience, 9 (2):181–197, 1992. (Cited on page 5.)
1992
-
[73]
Matteo Carandini and David J. Heeger. Normalization as a canonical neural computation. Nature Reviews Neuroscience, 13(1):51–62, 2012. (Cited on page 5.)
2012
-
[74]
Contralateral delay activity provides a neural measure of the number of representations in visual working memory.Journal of neurophysiology, 103(4):1963–1968, 2010
Akiko Ikkai, Andrew W McCollough, and Edward K V ogel. Contralateral delay activity provides a neural measure of the number of representations in visual working memory.Journal of neurophysiology, 103(4):1963–1968, 2010. (Cited on page 5.)
1963
-
[75]
Electrophysiological measures of maintaining representations in visual working memory.Cortex, 43(1):77–94, 2007
Andrew W McCollough, Maro G Machizawa, and Edward K V ogel. Electrophysiological measures of maintaining representations in visual working memory.Cortex, 43(1):77–94, 2007. (Cited on page 5.)
2007
-
[76]
Animatediff: Animate your personalized text-to-image diffusion models without specific tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Fx2SbBgcte. (Cited on page 7 and 7.)
2024
-
[77]
Cosmos world foundation model platform for physical ai, 2025
NVIDIA team. Cosmos world foundation model platform for physical ai, 2025. URL https: //arxiv.org/abs/2501.03575. (Cited on page 7.)
Pith/arXiv arXiv 2025
-
[78]
Modelscope text-to-video technical report, 2023
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report, 2023. URL https://arxiv.org/abs/2308. 06571. (Cited on page 7 and 7.)
2023
-
[79]
Mochi 1: A new SOTA in open-source video generation models
Genmo Team. Mochi 1: A new SOTA in open-source video generation models. https: //github.com/genmoai/mochi, 2024. (Cited on page 7.)
2024
-
[80]
Ltx-video: Realtime video latent diffusion, 2024
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion, 2024. URLhttps://arxiv.org/abs/2501.00103. (Cited on page 7.)
Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.