Recognition: 2 theorem links
· Lean TheoremPhantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
Pith reviewed 2026-05-10 18:11 UTC · model grok-4.3
The pith
Phantom generates videos that are both visually realistic and physically consistent by jointly modeling visuals and latent physical dynamics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Phantom jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informative embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both,visu-
What carries the argument
The physics-aware video representation, an abstract embedding of underlying physics that enables joint prediction of physical dynamics and visual content without explicit physical specifications.
If this is right
- Generated videos exhibit motion and interactions that obey physical laws rather than producing visually plausible but impossible sequences.
- The model achieves competitive perceptual quality on standard video benchmarks while improving adherence to physical dynamics on specialized tests.
- No manual definition of physical properties or equations is needed for the joint prediction process.
- The same architecture supports both ordinary video continuation and physics-aware evaluation tasks.
Where Pith is reading between the lines
- This joint latent approach could be tested on whether it improves long-horizon prediction stability compared to purely visual models.
- The method might transfer to domains like 3D scene generation or robotic simulation where physical consistency is critical.
- Further experiments could measure how well the inferred states align with measurable quantities such as velocity or mass in controlled scenes.
Load-bearing premise
A physics-aware video representation can serve as an abstract yet informative embedding of the underlying physics to facilitate joint prediction without requiring explicit specification of complex physical dynamics and properties.
What would settle it
Generate videos of simple rigid-body interactions such as elastic collisions or projectile motion and check whether measured trajectories in the output violate conservation of momentum or energy when compared to ground-truth physics.
Figures
read the original abstract
Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Phantom, a generative video model that jointly infers latent physical dynamics and generates future frames by conditioning on observed frames and a physics-aware video representation. This representation is presented as an abstract embedding of underlying physics that enables joint prediction of dynamics and visual content without explicit physical specifications. The central claim is that the resulting videos are both perceptually realistic and physically consistent, with quantitative and qualitative superiority on standard video generation and physics-aware benchmarks.
Significance. If the joint modeling mechanism demonstrably enforces physical laws beyond data-driven correlations, the approach could meaningfully advance video synthesis by addressing the documented failure of scaled generative models to respect real-world dynamics. This would be relevant for downstream tasks requiring plausible motion, such as simulation and planning. The absence of any equations, loss terms, or verification procedures in the manuscript, however, prevents assessment of whether the result holds.
major comments (2)
- [Abstract] Abstract: the claim of outperformance on physics-aware benchmarks is unsupported because the manuscript supplies neither quantitative metrics, architecture diagrams, loss functions, nor ablation studies; without these, the central assertion that the model produces physically consistent trajectories cannot be evaluated.
- [Abstract] Abstract: the physics-aware video representation is described only as an 'abstract yet informative embedding' that 'facilitates joint prediction without requiring explicit specification'; no mechanism (constraint, loss term, or verification) is given to ensure the latents capture physical laws rather than visual correlations, which is load-bearing for the consistency claim.
minor comments (2)
- [Abstract] Abstract: 'in his work' should read 'in this work'.
- [Abstract] Abstract: 'informaive' is a typographical error for 'informative'.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive comments. The feedback identifies key areas where the abstract could better support the paper's claims by including more concrete details. We address each point below and will make the corresponding revisions to improve clarity and evaluability.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of outperformance on physics-aware benchmarks is unsupported because the manuscript supplies neither quantitative metrics, architecture diagrams, loss functions, nor ablation studies; without these, the central assertion that the model produces physically consistent trajectories cannot be evaluated.
Authors: We acknowledge that the abstract is high-level and does not embed specific numbers or direct references, which can make standalone evaluation challenging. The full manuscript reports quantitative metrics on physics-aware benchmarks in Tables 1 and 2, presents the architecture in Figure 1, details loss functions in Section 3.2 (Equations 3-5), and includes ablation studies in Section 4.3. To directly address the concern, we will revise the abstract to incorporate key quantitative results (e.g., specific gains on physical consistency metrics) along with explicit pointers to the relevant tables, figures, and sections. This will make the performance claims immediately verifiable from the abstract. revision: yes
-
Referee: [Abstract] Abstract: the physics-aware video representation is described only as an 'abstract yet informative embedding' that 'facilitates joint prediction without requiring explicit specification'; no mechanism (constraint, loss term, or verification) is given to ensure the latents capture physical laws rather than visual correlations, which is load-bearing for the consistency claim.
Authors: We agree the abstract description is brief and does not specify the underlying mechanism. In the full manuscript, the representation is realized through joint modeling with a dedicated latent dynamics prediction loss (Section 3.3, Equation 4) that enforces consistency between predicted physical states and observed motion, going beyond pure visual correlations. This is verified via benchmark comparisons and ablations that isolate the contribution of the physics component. We will revise the abstract to include a concise statement on the joint objective and verification approach, and we will add a brief clarifying sentence on how the loss term promotes physical consistency. revision: yes
Circularity Check
No circularity: architecture proposal with independent empirical claims
full rationale
The paper presents Phantom as a new joint-modeling architecture that infers a physics-aware video representation alongside visual generation. No equations, derivations, or load-bearing steps are shown that reduce by construction to fitted parameters, self-citations, or renamed inputs. The central claim rests on benchmark results for visual fidelity and physical consistency rather than any tautological redefinition of terms. This is a standard self-contained proposal of a generative model whose validity is intended to be assessed externally via experiments.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Latent physical properties can be inferred directly from video frames and modeled jointly with visual content
- ad hoc to paper A physics-aware video representation provides an informative abstract embedding of underlying physics
invented entities (2)
-
physics-aware video representation
no independent evidence
-
latent physical dynamics
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Phantom leverages a physics-aware video representation that serves as an abstract yet informative embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dual-branch flow-matching architecture that couples a pretrained video generator with a dedicated physics branch operating in a physics-aware latent space
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,
work page internal anchor Pith review arXiv
-
[2]
Meta movie gen: Ai-powered movie generation,
Meta AI. Meta movie gen: Ai-powered movie generation,
-
[3]
Accessed: 2024-11-24. 2
2024
-
[4]
Building normalizing flows with stochastic interpolants
Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InInterna- tional Conference on Learning Representations (ICLR), 2023. 3
2023
-
[5]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 2, 4, 1
work page internal anchor Pith review arXiv 2025
-
[6]
Videophy: Evaluating physical commonsense for video generation
Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai- Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. InInternational Con- ference on Learning Representations (ICLR), 2025. 3, 6, 1
2025
-
[7]
Videophy-2: A challenging action-centric physical commonsense evaluation in video generation
Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Golden- berg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation. InInternational Conference on Learning Representations (ICLR), 2026. 3, 6, 1
2026
-
[8]
Lumiere: A space-time diffusion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 7
2024
-
[9]
Video- jam: Joint appearance-motion representations for enhanced motion generation in video models
Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. Video- jam: Joint appearance-motion representations for enhanced motion generation in video models. InInternational Confer- ence on Machine Learning (ICML). PMLR, 2025. 3
2025
-
[10]
Videocrafter2: Overcoming data limitations for high-quality video diffusion models
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7310–7320, 2024. 6, 1
2024
-
[11]
Flow matching on general geometries
Ricky TQ Chen and Yaron Lipman. Flow matching on general geometries. InInternational Conference on Learning Representations (ICLR), 2024. 2, 3, 5
2024
-
[12]
Veo2: Our state-of-the-art video generation model,
DeepMind. Veo2: Our state-of-the-art video generation model,
-
[13]
Accessed: 2025-01-09. 2
2025
-
[14]
Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, and Yann LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025. 2, 4, 1
-
[15]
Recurrent world models facilitate policy evolution.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018
David Ha and J¨ urgen Schmidhuber. Recurrent world models facilitate policy evolution.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018. 2
2018
-
[16]
Denoising diffu- sion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 33:6840–6851, 2020. 2, 3
2020
-
[17]
Video diffusion models.Advances in Neural Information Processing Systems (NeurIPS), 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in Neural Information Processing Systems (NeurIPS), 35:8633–8646, 2022. 2
2022
-
[18]
Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul De- bevec, and Ziwei Liu. Vchain: Chain-of-visual-thought for rea- soning in video generation.arXiv preprint arXiv:2510.05094,
-
[19]
How far is video generation from world model: A physical law perspective
Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. InInternational Conference on Machine Learning (ICML),
-
[20]
Videopoet: A large language model for zero-shot video generation
Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birod- kar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. InIn- ternational Conference on Machine Learning (ICML), pages 25105–25124. PMLR, 2024. 7
2024
-
[21]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3, 6, 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Boosting generative image modeling via joint image-feature synthe- sis
Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakoge- orgiou, Spyros Gidaris, and Nikos Komodakis. Boosting generative image modeling via joint image-feature synthe- sis. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 3
2025
-
[23]
A path towards autonomous machine intelligence version 0.9
Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022. 2
2022
-
[24]
Flow matching for generative model- ing
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative model- ing. InInternational Conference on Learning Representations (ICLR), 2023. 3
2023
-
[25]
Physgen: Rigid-body physics-grounded image-to- video generation
Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to- video generation. InEuropean Conference on Computer Vision (ECCV), pages 360–378. Springer, 2024. 3
2024
-
[26]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. 3
2023
-
[27]
To- wards world simulator: Crafting physical commonsense-based benchmark for video generation
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Quanfeng Lu, Wenqi Shao, Kaipeng Zhang, Yu Cheng, Dianqi Li, and Ping Luo. To- wards world simulator: Crafting physical commonsense-based benchmark for video generation. InInternational Conference on Machine Learning (ICML), pages 43781–43806. PMLR,
-
[28]
Motioncraft: Physics-based zero-shot video generation.Advances in Neural Information Processing Systems (NeurIPS), 37:123155–123181, 2024
Antonio Montanaro, Luca Savant Aira, Emanuele Aiello, Diego Valsesia, and Enrico Magli. Motioncraft: Physics-based zero-shot video generation.Advances in Neural Information Processing Systems (NeurIPS), 37:123155–123181, 2024. 3
2024
-
[29]
InProceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640
Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models learn physical principles from watching videos?arXiv preprint arXiv:2501.09038, 2025. 2, 3, 6, 1
-
[30]
Openvid-1m: A large-scale high-quality dataset for text-to- video generation
Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation. InInternational Conference on Learning Representations (ICLR), 2025. 5
2025
-
[31]
Sora: Openai’s multimodal agent, 2024
OpenAI. Sora: Openai’s multimodal agent, 2024. Accessed: 2024-11-24. 2, 3
2024
-
[32]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 2
2023
-
[33]
Wenxu Qian, Chaoyue Wang, Hou Peng, Zhiyu Tan, Hao Li, and Anxiang Zeng. Rdpo: Real data preference optimization for physics consistency video generation.arXiv preprint arXiv:2506.18655, 2025. 7
-
[34]
Runway: Platform for AI-powered video editing and generative media creation
Runway Team. Runway: Platform for AI-powered video editing and generative media creation. https://runwayml. com, 2024. Accessed: 2025-05-12. 7
2024
-
[35]
Egoforge: Goal-directed egocen- tric world simulator.arXiv preprint arXiv:2603.20169, 2026
Yifan Shen, Jiateng Liu, Xinzhuo Li, Yuanzhe Liu, Bingx- uan Li, Houze Yang, Wenqi Jia, Yijiang Li, Tianjiao Yu, James Matthew Rehg, et al. Egoforge: Goal-directed egocen- tric world simulator.arXiv preprint arXiv:2603.20169, 2026. 2
-
[36]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations (ICLR), 2021. 3
2021
-
[37]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations (ICLR), 2021. 3
2021
-
[38]
Pyratok: Language-aligned pyramidal tokenizer for video understanding and generation
Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A Nguyen, Dong-Hwan Jang, Inderjit S Dhillon, and Ismini Lourentzou. Pyratok: Language-aligned pyramidal tokenizer for video understanding and generation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026. 2
2026
-
[39]
Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 30, 2017. 2
2017
-
[40]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pan- deng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
WISA: World simulator assistant for physics-aware text-to-video genera- tion
Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Jiasong Feng, Zhanjie Zhang, Wanyuan Pang, and Xiaodan Liang. WISA: World simulator assistant for physics-aware text-to-video genera- tion. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 6, 1
2025
-
[42]
Lavie: High-quality video generation with cascaded latent diffusion models.International Journal on Computer Vision (IJCV), 133(5):3059–3078, 2025
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal on Computer Vision (IJCV), 133(5):3059–3078, 2025. 6, 1
2025
-
[43]
Physanimator: Physics-guided generative cartoon animation
Tianyi Xie, Yiwei Zhao, Ying Jiang, and Chenfanfu Jiang. Physanimator: Physics-guided generative cartoon animation. InIEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 10793–10804, 2025. 3
2025
-
[44]
Phyt2v: Llm-guided iterative self-refinement for physics-grounded text-to-video generation
Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self-refinement for physics-grounded text-to-video generation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 18826–18836,
-
[45]
Cogvideox: Text-to-video dif- fusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video dif- fusion models with an expert transformer. InInternational Conference on Learning Representations (ICLR), 2025. 6, 7, 1
2025
-
[46]
Representa- tion alignment for generation: Training diffusion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representa- tion alignment for generation: Training diffusion transformers is easier than you think. InInternational Conference on Learning Representations (ICLR), 2025. 3
2025
-
[47]
Ke Zhang, Cihan Xiao, Yiqun Mei, Jiacong Xu, and Vishal M Patel. Think before you diffuse: Llms-guided physics-aware video generation.arXiv preprint arXiv:2505.21653, 2025. 3
-
[48]
VideoREPA: Learning physics for video generation through relational alignment with foundation models
Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. VideoREPA: Learning physics for video generation through relational alignment with foundation models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 3, 6, 1, 2
2025
-
[49]
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei- Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 5, 6, 1 Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics Sup...
work page internal anchor Pith review arXiv 2025
-
[50]
Since most physics-focused baselines operate solely in the text-to-video setting, Figure 5 comparesPhantomonly with general-purpose T2V models. D. Physics-based Video Control To further evaluate the ability ofPhantomto model and re- spond to explicit physical control signals, we apply our frame- work to the Force-Prompting dataset 1. Force-Prompting provi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.