pith. sign in

arxiv: 2606.04737 · v1 · pith:CSARCS2Dnew · submitted 2026-06-03 · 💻 cs.CV

Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment

Pith reviewed 2026-06-28 07:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationphysics-informedmixture-of-expertslatent alignmentflow matchingphysical plausibilityadapter traininganchored field estimation
0
0 comments X

The pith

PILA aligns frozen video model latents to a physics attribute bank via anchored motion estimates and mixture-of-experts routing to raise physical plausibility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large video generation models produce visually coherent output but still generate motion that violates real-world physical regularities because their training optimizes pixel-level fitting rather than dynamics. PILA addresses this by first estimating a bank of physical attributes from the model's internal latents, using visible motion as an anchor to infer harder-to-observe quantities such as forces or material properties. A mixture-of-experts module then routes these estimates to category-specific experts whose outputs are constrained by residuals drawn from physical relations before the whole bank is decoded into a corrective term for the model's flow-matching vector field. The adapter is trained in stages on a smaller backbone and transferred directly to a much larger one, yielding measurable gains on both visual and physics-specific benchmarks while leaving the original model weights untouched.

Core claim

PILA injects physics-structured latent guidance into the frozen flow-matching dynamics of pretrained video models by mapping latents into an operational physical attribute bank through anchored field estimation, applying label-prior masked expert routing over physical categories, and decoding the refined proxies as a correction to the vector field.

What carries the argument

Anchored field estimation that converts frozen-generator latents into a physical attribute bank organized by field-proxy slots, with observable motion serving as the kinematic anchor for constructing the remaining proxies.

If this is right

  • Staged adapter training on the 1.3B Wan 2.1 model transfers directly to the 14B Wan 2.2 model without further backbone updates.
  • Label-prior masked expert routing selects category-specific operator experts whose refinements are regularized by operational residuals from physical relations.
  • The fused physical attribute bank is decoded into an additive correction to the flow-matching vector field.
  • The separation of visual prior and physics correction preserves the pretrained model's semantic and visual capabilities while adding plausibility.
  • Benchmark gains appear simultaneously on VBench-2.0 for visual quality and on VideoPhy-2 plus PhyGenBench for physical plausibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-to-proxy mapping could be applied to image or 3D generative models where physical consistency is required.
  • Expanding the attribute bank to include interaction-level proxies such as contact forces might reduce artifacts in multi-object scenes.
  • If the routing remains efficient at larger scale, the approach offers a modular way to update physics knowledge independently of visual training.
  • The method's reliance on a fixed set of physical categories suggests testing whether new categories can be added without retraining the entire router.

Load-bearing premise

Observable motion can serve as a reliable kinematic anchor to construct accurate estimates of less directly observed physical proxies that are then organized into a usable attribute bank.

What would settle it

A controlled test set of generated videos where the frequency of physical-law violations (measured by the same PhyGenBench or VideoPhy-2 metrics) is statistically higher with the PILA adapter than without it would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.04737 by Cong Wang, Hanxin Zhu, Jiayi Luo, Long Chen, Peiyan Tu, Xiaoqian Cheng, Xin Jin, Yonglin Tian, Zhibo Chen.

Figure 1
Figure 1. Figure 1: Standard video generation vs. PILA. Standard generators may produce physically implausible dynamics. PILA introduces a physics-structured latent adapter with routed experts and flow correction to improve physical plausibility. physical realism through implicit inductive biases in data-driven video generation. Existing methods have explored external physical reasoning and planning to guide synthesis toward … view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of PILA. The method proceeds through three stages: (1) operational physical interface encoding and label-prior expert routing (Sec. 3.1); (2) physics-structured operator refinement of category-specific field-proxy representations (Sec. 3.2); and (3) physics-to-latent flow alignment through a lightweight correction to the frozen flow-matching vector field (Sec. 3.3). 3.1 Physical Interface Encoding… view at source ↗
Figure 3
Figure 3. Figure 3: Training data distribution across physical categories. We relabel WISA-80K into eight physical categories, grouped into four material categories and four phenomenon categories [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with baselines. PILA better captures physically grounded interactions and state changes across diverse phenomena [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Quality-efficiency trade-off with different numbers of routed experts. Orange markers denote the default k = 4 setting. including Mechanics, Optics, Thermal, and Material. In addition to physics evaluation, we include the Imaging Quality (Imaging) metric from VBench [45] to assess visual quality. 4.2 Implementation Details We train PILA on top of Wan 2.1-T2V-1.3B [16]. The pretrained video generator is fro… view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of the expert router. r represents the Pearson correlation coefficient calculated between different distributions. An apple falls into a vat of cider, sending up a spray. Base Generator Physics Correction PILA Output Fluid 0.56 Rigid 0.16 Phase 0.14 Compressible 0.13 Routed Experts A small piece of copper is ignited, producing a vivid green-blue flame. Thermal 0.34 Optical 0.27 Fluid 0.22 Rigid 0.… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of routed expert activations. Expert maps show that PILA applies category￾specific updates to physically active regions. the AFE module further shows that stable physical modeling requires a dedicated latent-to-physics feature interface. More detailed explanation and additional ablations are listed in Appendix Sec. C.1. Quality-Efficiency Trade-off Study [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt-refinement template used before LPMER routing. The template rewrites a raw input prompt into a concise physics-oriented description that emphasizes objects, motion, interactions, material cues, and state changes. This refined description is used as cleaner prompt evidence for the subsequent category-labeling and label-prior routing steps. Prompt: {refined_prompt} User Prompt You classify video promp… view at source ↗
Figure 9
Figure 9. Figure 9: LLM-based category-label estimation template for LPMER. Given the refined physics￾oriented description, the template asks the LLM to select the relevant physical categories from the eight-category expert set P. The resulting multi-label set Y is used as label-prior evidence for routing. where τ is the routing temperature. In the default setting, k = 4, which keeps the physics branch sparse while allowing m… view at source ↗
Figure 10
Figure 10. Figure 10: Effect of the denoising-step schedule. Gains are computed relative to the no-adapter baseline in [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison in the 1.3B–5B setting. PILA produces more physically plausible motion and interactions under the same prompts. 13 [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison in the high-capacity setting. PILA produces more plausible physical evolution under the same prompts. 14 [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional PILA generations across physical categories. Each row shows sampled frames from one physical category in our eight-category taxonomy. 15 [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
read the original abstract

Large-scale video generation models have made remarkable progress in semantic consistency and visual quality, producing videos that are increasingly coherent and visually convincing. Nevertheless, the dynamics induced by pixel-level fitting do not naturally accommodate the regularities that govern real-world motion and interaction, resulting in persistent shortcomings in physical plausibility. To address this limitation, we propose \textbf{PILA} (Physics-Informed Latent Alignment), a framework that injects physics-structured latent guidance into the frozen flow-matching dynamics of pretrained video models. Specifically, PILA first employs anchored field estimation to map frozen-generator latents into an operational physical attribute bank organized by field-proxy slots, using observable motion as a kinematic anchor for constructing less directly observed proxies. To handle the heterogeneity of real-world dynamics, PILA adopts a mixture-of-experts design over physical categories. Label-prior masked expert routing selects category-specific operator experts, whose refinements are regularized by operational residuals abstracted from physical relations. Finally, the refined proxies are fused into the physical attribute bank and decoded into a correction to the flow-matching vector field, injecting physics-aware guidance while preserving the visual prior of the pretrained backbone. With staged adapter training on Wan 2.1-1.3B and direct transfer of the learned adapter to Wan 2.2-14B, PILA achieves state-of-the-art results on VBench-2.0, VideoPhy-2, and PhyGenBench in both visual quality and benchmark-measured physical plausibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PILA, a framework for injecting physics-structured guidance into the frozen flow-matching dynamics of pretrained video generators. It first applies anchored field estimation to map generator latents into a physical attribute bank organized by field-proxy slots, using observable motion as the kinematic anchor for less-observed proxies. A mixture-of-experts architecture with label-prior masked routing then refines category-specific operators, whose outputs are regularized by operational residuals derived from physical relations before being decoded as a correction to the flow-matching vector field. Staged adapter training is performed on Wan 2.1-1.3B with direct transfer to Wan 2.2-14B, yielding claimed state-of-the-art results on VBench-2.0, VideoPhy-2, and PhyGenBench for both visual quality and physical plausibility.

Significance. If the anchored proxies prove accurate and the MoE regularization demonstrably improves physical consistency without harming visual fidelity, the work would offer a scalable route to physics-aware video generation that preserves the visual prior of large backbones. The direct transfer from 1.3B to 14B adapters is a practical strength that could generalize to other generative settings. The absence of any reported validation for the proxy-construction step, however, leaves the claimed benchmark gains dependent on an unverified inverse-problem component.

major comments (2)
  1. [Abstract] Abstract (anchored field estimation paragraph): observable motion is asserted to serve as a sufficient kinematic anchor for constructing less-observed physical proxies that populate the attribute bank; no quantitative validation, ablation, or error metric on proxy accuracy is supplied, yet every downstream component (label-prior masked routing, operational-residual regularization, and vector-field correction) operates directly on this bank.
  2. [Abstract] Abstract (results claim): the assertion of SOTA performance on VideoPhy-2 and PhyGenBench after transfer supplies neither ablations, error bars, nor details on how the physical residuals are computed or validated against ground-truth dynamics, rendering the central empirical claim unverifiable from the provided text.
minor comments (1)
  1. [Abstract] The abstract introduces several compound terms ("operational-residual regularization," "field-proxy slots") without inline mathematical definitions or forward references to the sections where they are formalized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and commit to revisions that improve the verifiability of the proxy-construction step and the empirical claims without altering the core technical contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract (anchored field estimation paragraph): observable motion is asserted to serve as a sufficient kinematic anchor for constructing less-observed physical proxies that populate the attribute bank; no quantitative validation, ablation, or error metric on proxy accuracy is supplied, yet every downstream component (label-prior masked routing, operational-residual regularization, and vector-field correction) operates directly on this bank.

    Authors: We agree that the abstract and main text would benefit from explicit quantitative validation of the anchored field estimation. The manuscript demonstrates end-to-end improvements on the target benchmarks, but does not isolate proxy accuracy with dedicated error metrics or ablations against ground-truth kinematics. We will add a new subsection (and corresponding table) reporting proxy reconstruction error, sensitivity to the motion anchor, and an ablation removing the anchor, using available synthetic and real sequences with known physical attributes. This will be placed before the MoE routing experiments. revision: yes

  2. Referee: [Abstract] Abstract (results claim): the assertion of SOTA performance on VideoPhy-2 and PhyGenBench after transfer supplies neither ablations, error bars, nor details on how the physical residuals are computed or validated against ground-truth dynamics, rendering the central empirical claim unverifiable from the provided text.

    Authors: We acknowledge the concern. The current manuscript reports aggregate benchmark scores and a direct 1.3B-to-14B transfer result, but the abstract and results section lack per-metric error bars, component ablations isolating the operational residuals, and an explicit description of residual computation/validation. We will expand the experiments section with (i) error bars over multiple seeds, (ii) an ablation table removing or replacing the residual regularization, and (iii) a methods paragraph detailing the residual formulation together with a small-scale validation against ground-truth dynamics on controlled sequences. These additions will make the SOTA claims directly verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external physical relations and observable inputs without self-referential reduction

full rationale

The paper's core procedure maps latents via anchored field estimation using observable motion as kinematic anchor, then applies MoE routing and operational-residual regularization drawn from physical relations before decoding a flow-matching correction. No quoted equations, parameter fits, or self-citations reduce any claimed prediction or proxy to a quantity defined solely by the paper's own outputs or prior author work; the attribute bank construction and guidance injection remain independent of the final benchmark scores. The derivation chain is therefore self-contained against external physical priors rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the premise that a physical attribute bank can be populated from generator latents using motion as anchor and that category-specific experts can be selected and regularized without introducing new fitted constants beyond standard training.

axioms (1)
  • domain assumption Frozen flow-matching dynamics of pretrained video models can accept additive corrections derived from external physical relations without destabilizing generation.
    Invoked in the description of decoding refined proxies into a correction to the flow-matching vector field.
invented entities (1)
  • physical attribute bank organized by field-proxy slots no independent evidence
    purpose: Intermediate representation that maps generator latents to observable and inferred physical quantities
    Introduced as the target of anchored field estimation; no independent falsifiable handle supplied in abstract.

pith-pipeline@v0.9.1-grok · 5818 in / 1455 out tokens · 22278 ms · 2026-06-28T07:00:50.978341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 12 linked inside Pith

  1. [1]

    Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

  2. [2]

    Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  3. [3]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  4. [4]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations, 2025

  5. [5]

    Physgaussian: Physics-integrated 3d gaussians for generative dynamics

    Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang. Physgaussian: Physics-integrated 3d gaussians for generative dynamics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4389–4398, 2024

  6. [6]

    Physgen3d: Crafting a miniature interactive world from a single image

    Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang. Physgen3d: Crafting a miniature interactive world from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6178–6189, 2025

  7. [7]

    Physdreamer: Physics-based interaction with 3d objects via video generation

    Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. In European Conference on Computer Vision, pages 388–406, 2024

  8. [8]

    Won- derplay: Dynamic 3d scene generation from a single image and actions.arXiv preprint arXiv:2505.18151, 2025

    Zizhang Li, Hong-Xing Yu, Wei Liu, Yin Yang, Charles Herrmann, Gordon Wetzstein, and Jiajun Wu. Won- derplay: Dynamic 3d scene generation from a single image and actions.arXiv preprint arXiv:2505.18151, 2025

  9. [9]

    Phyt2v: Llm-guided iterative self-refinement for physics-grounded text-to-video generation

    Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self-refinement for physics-grounded text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826–18836, 2025

  10. [10]

    Phyrpr: Training-free physics-constrained video generation.arXiv preprint arXiv:2601.09255, 2026

    Yibo Zhao, Hengjia Li, Xiaofei He, and Boxi Wu. Phyrpr: Training-free physics-constrained video generation.arXiv preprint arXiv:2601.09255, 2026

  11. [11]

    Vlipp: Towards physically plausible video generation with vision and language informed physical prior

    Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, et al. Vlipp: Towards physically plausible video generation with vision and language informed physical prior. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12360–12370, 2025

  12. [12]

    Inference-time physics alignment of video generative models with latent world models.arXiv preprint arXiv:2601.10553, 2026

    Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari- Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, and Adriana Romero-Soriano. Inference-time physics alignment of video generative models with latent world models.arXiv preprint arXiv:2601.10553, 2026

  13. [13]

    Phantom: Physics-infused video generation via joint modeling of visual and latent physical dynamics.arXiv preprint arXiv:2604.08503, 2026

    Ying Shen, Jerry Xiong, Tianjiao Yu, and Ismini Lourentzou. Phantom: Physics-infused video generation via joint modeling of visual and latent physical dynamics.arXiv preprint arXiv:2604.08503, 2026

  14. [14]

    Wisa: World simulator assistant for physics-aware text-to-video generation

    Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, et al. Wisa: World simulator assistant for physics-aware text-to-video generation. arXiv preprint arXiv:2503.08153, 2025

  15. [15]

    Prophy: Progressive physical alignment for dynamic world simulation.arXiv preprint arXiv:2512.05564, 2025

    Zijun Wang, Panwen Hu, Jing Wang, Terry Jingchen Zhang, Yuhao Cheng, Long Chen, Yiqiang Yan, Zutao Jiang, Hanhui Li, and Xiaodan Liang. Prophy: Progressive physical alignment for dynamic world simulation.arXiv preprint arXiv:2512.05564, 2025

  16. [16]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  17. [17]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 10

  18. [18]

    Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

  19. [19]

    Flowvid: Taming imperfect optical flows for consistent video-to-video synthesis

    Feng Liang, Bichen Wu, Jialiang Wang, Licheng Yu, Kunpeng Li, Yinan Zhao, Ishan Misra, Jia-Bin Huang, Peizhao Zhang, Peter Vajda, et al. Flowvid: Taming imperfect optical flows for consistent video-to-video synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8207–8216, 2024

  20. [20]

    Worlddreamer: Towards general world models for video generation via predicting masked tokens.arXiv preprint arXiv:2401.09985, 2024

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Boyuan Wang, Xinze Chen, and Jiwen Lu. Worlddreamer: Towards general world models for video generation via predicting masked tokens.arXiv preprint arXiv:2401.09985, 2024

  21. [21]

    V oyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025

    Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson Lau, Wangmeng Zuo, et al. V oyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025

  22. [22]

    Make it move: controllable image-to-video generation with text descriptions

    Yaosi Hu, Chong Luo, and Zhenzhong Chen. Make it move: controllable image-to-video generation with text descriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18219–18228, 2022

  23. [23]

    Panacea: Panoramic and controllable video generation for autonomous driving

    Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6902–6912, 2024

  24. [24]

    Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling

    Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

  25. [25]

    Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371, 2024

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371, 2024

  26. [26]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012

  27. [27]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020

  28. [28]

    Vidgen-1m: A large-scale dataset for text-to-video generation.arXiv preprint arXiv:2408.02629, 2024

    Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video generation.arXiv preprint arXiv:2408.02629, 2024

  29. [29]

    Physgen: Rigid-body physics- grounded image-to-video generation

    Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics- grounded image-to-video generation. InEuropean Conference on Computer Vision, pages 360–378, 2024

  30. [30]

    Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation.arXiv preprint arXiv:2501.18982, 2025

    Yuchen Lin, Chenguo Lin, Jianjin Xu, and Yadong Mu. Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation.arXiv preprint arXiv:2501.18982, 2025

  31. [31]

    What about gravity in video generation? post-training newton’s laws with verifiable rewards.arXiv preprint arXiv:2512.00425, 2025

    Minh-Quan Le, Yuanzhi Zhu, Vicky Kalogeiton, and Dimitris Samaras. What about gravity in video generation? post-training newton’s laws with verifiable rewards.arXiv preprint arXiv:2512.00425, 2025

  32. [32]

    Realwonder: Real-time physical action-conditioned video generation.arXiv preprint arXiv:2603.05449, 2026

    Wei Liu, Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, and Jiajun Wu. Realwonder: Real-time physical action-conditioned video generation.arXiv preprint arXiv:2603.05449, 2026

  33. [33]

    Neuma: Neural material adaptor for visual grounding of intrinsic dynamics.Advances in Neural Information Processing Systems, 37:65643–65669, 2024

    Junyi Cao, Shanyan Guan, Yanhao Ge, Wei Li, Xiaokang Yang, and Chao Ma. Neuma: Neural material adaptor for visual grounding of intrinsic dynamics.Advances in Neural Information Processing Systems, 37:65643–65669, 2024

  34. [34]

    Newton- gen: Physics-consistent and controllable text-to-video generation via neural newtonian dynamics.arXiv preprint arXiv:2509.21309, 2025

    Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zeeshan Nadir, Bole Ma, and Stanley H Chan. Newton- gen: Physics-consistent and controllable text-to-video generation via neural newtonian dynamics.arXiv preprint arXiv:2509.21309, 2025. 11

  35. [35]

    Phys4dgen: Physics- compliant 4d generation with multi-material composition perception

    Jiajing Lin, Zhenzhong Wang, Dejun Xu, Shu Jiang, Yunpeng Gong, and Min Jiang. Phys4dgen: Physics- compliant 4d generation with multi-material composition perception. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10398–10407, 2025

  36. [36]

    Physical simulator in-the-loop video generation.arXiv preprint arXiv:2603.06408, 2026

    Lin Geng Foo, Mark He Huang, Alexandros Lattas, Stylianos Moschoglou, Thabo Beeler, and Christian Theobalt. Physical simulator in-the-loop video generation.arXiv preprint arXiv:2603.06408, 2026

  37. [37]

    Cp4d: Composi- tional physics-aware 4d scene generation

    Hanxin Zhu, Cong Wang, Tianyu He, Long Chen, Xin Jin, Chen Gao, and Zhibo Chen. Cp4d: Composi- tional physics-aware 4d scene generation

  38. [38]

    Learning physics-grounded 4d dynamics with neural gaussian force fields

    Shiqian Li, Ruihong Shen, Junfeng Ni, Chang Pan, Chi Zhang, and Yixin Zhu. Learning physics-grounded 4d dynamics with neural gaussian force fields. InThe F ourteenth International Conference on Learning Representations, 2026

  39. [39]

    Chain of event-centric causal thought for physically plausible video generation.arXiv preprint arXiv:2603.09094, 2026

    Zixuan Wang, Yixin Hu, Haolan Wang, Feng Chen, Yan Liu, Wen Li, and Yinjie Lei. Chain of event-centric causal thought for physically plausible video generation.arXiv preprint arXiv:2603.09094, 2026

  40. [40]

    Videorepa: Learning physics for video generation through relational alignment with foundation models

    Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models. arXiv preprint arXiv:2505.23656, 2025

  41. [41]

    Phygdpo: Physics-aware groupwise direct preference optimization for physically consistent text-to-video generation.arXiv preprint arXiv:2512.24551, 2025

    Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, et al. Phygdpo: Physics-aware groupwise direct preference optimization for physically consistent text-to-video generation.arXiv preprint arXiv:2512.24551, 2025

  42. [42]

    Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024

  43. [43]

    Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

    Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

  44. [44]

    Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

  45. [45]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  46. [46]

    Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

  47. [47]

    Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  48. [48]

    refined_prompt

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020. 12 A Additional Method Details A.1 AFE Field Construction This section details how AFE converts th...

  49. [49]

    The residual ∂tdk −u k comes directly from the displacement–velocity relation ∂td=u

    Rigid Body.The rigid-body expert is derived from rigid kinematics rather than from a deformable material model. The residual ∂tdk −u k comes directly from the displacement–velocity relation ∂td=u . A rigid body should have negligible internal strain, so εk acts as a low-strain prior and σk −γ σεk keeps the stress-like proxy compatible with that regime up ...

  50. [50]

    Elastic.The elastic expert abstracts the small-deformation wave behavior implied by linear elasticity. In a homogeneous medium, the displacement field satisfies a wave-like equation after simplifying the elastodynamic balance; this motivates ∂2 t dk −c 2 d∆dk, and the same propagation bias is applied to the velocity slot through ∂2 t uk −c 2 u∆uk. The ter...

  51. [51]

    The continuity residual ∂tρk +∇·(ρ kuk) is the conservative form of mass conservation

    Fluid.The fluid expert follows the standard mass and momentum balances used in viscous flow. The continuity residual ∂tρk +∇·(ρ kuk) is the conservative form of mass conservation. The momentum residual is a Navier–Stokes-style balance containing temporal acceleration, convective transport, pressure forcing, and viscous diffusion. Since pressure and densit...

  52. [52]

    The continuity term again comes from ∂tρ+∇ ·(ρu) = 0

    Compressible Flow.The compressible-flow expert keeps mass conservation but replaces the incompressible-style momentum emphasis with terms that expose expansion and compression. The continuity term again comes from ∂tρ+∇ ·(ρu) = 0 . The residual ∂t(∇ ·u k)−c p∆pk tracks temporal changes of local compression and couples them to pressure-like variation under...

  53. [53]

    Phase Change.The phase-change expert is motivated by transported phase indicators and Stefan-style thermal coupling. The phase slot αk is treated as an occupancy/support proxy, so ∂tαk +∇·(α kuk) links support change to local motion, standing in for an advective phase-balance equation with unresolved source terms. The combined thermal term∂tTk +u k·∇T k −...

  54. [54]

    The kinematic term ∂tdk −u k again enforces the relation between displacement and velocity

    Collision.The collision expert encodes impulse-response structure rather than solving comple- mentarity contact conditions. The kinematic term ∂tdk −u k again enforces the relation between displacement and velocity. The impulse term ∂tuk −γ j jk is derived from the impulse-momentum relation m ∂tu≃j after absorbing mass and scale into the latent proxy. The...

  55. [55]

    The term ∂tTk +∇·(u k ¯Tk)− κ∆Tk combines temporal storage, conservative advection, and diffusion, while ∇Tk limits isolated temperature spikes

    Thermal.The thermal expert uses a compact heat-transport proxy. The term ∂tTk +∇·(u k ¯Tk)− κ∆Tk combines temporal storage, conservative advection, and diffusion, while ∇Tk limits isolated temperature spikes. We omit unknown heat sources, material heat capacity, and boundary fluxes, so the resulting residual is a low-frequency thermal prior rather than a ...

  56. [56]

    w/o Expert Routing

    Optical.The optical expert abstracts wave-like propagation into a scalar latent proxy. The residual ∂2 t ψk −c 2 ψ∆ψk follows the scalar wave equation obtained after suppressing constants and polarization details. The support-gating term (1−α k)ψk discourages wave-like activation outside the associated support proxy, while ∆αk smooths that support. These ...