pith. sign in

arxiv: 2606.26916 · v1 · pith:RWZ3LCCTnew · submitted 2026-06-25 · 💻 cs.CV

PhysRAG: Enhancing Physics-Awareness in Video Generation via Retrieval-Augmented Generation

Pith reviewed 2026-06-26 05:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationretrieval-augmented generationphysics-aware generationvideo diffusion modelsphysical rule compliancedata filtering pipeline
0
0 comments X

The pith

PhysRAG retrieves examples from a curated 7K-video physics database to correct diffusion-model violations of physical rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that retrieval-augmented generation can make video diffusion models follow physical laws more reliably by pulling relevant real-world motion examples at inference time. Current models often produce impossible dynamics because training data lacks sufficient physical variety and because the generative process itself does not enforce conservation or interaction rules. By first filtering the WISA-80K collection down to 7K high-fidelity clips and then injecting retrieved knowledge through learnable queries, the method claims measurable gains on both visual quality and explicit physical-compliance benchmarks. If the approach works, generated videos would exhibit fewer floating objects, incorrect collisions, or impossible thermal flows without any change to the underlying diffusion architecture.

Core claim

PhysRAG constructs a physical video database from a two-stage filtered subset of WISA-80K, then uses a retrieval mechanism with learnable queries to inject physical knowledge directly into a video diffusion model, producing state-of-the-art results on PhyGenBench and VBench for both visual fidelity and adherence to mechanics, optics, and thermal dynamics.

What carries the argument

The RAG pipeline that retrieves from the 7K-video physical database and injects the information via learnable queries into the diffusion process.

If this is right

  • Video diffusion models can be made more physically consistent without retraining the entire network from scratch.
  • A modest curated database suffices to supply the missing physical priors that large-scale web video data alone do not provide.
  • Learnable queries offer a modular way to condition generation on retrieved physical examples across different motion categories.
  • Ablation results indicate that both the data-filtering stage and the retrieval injection step are necessary for the reported gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval-plus-query pattern could be tested on domains that also suffer from sparse rule-following data, such as biological motion or fluid simulation.
  • If the filtered set proves biased toward certain object scales or lighting conditions, future work could add explicit diversity constraints during the second filtering stage.
  • The approach leaves open whether the retrieved clips are used only at inference or could also be mixed into continued training of the base diffusion model.

Load-bearing premise

The 7K filtered videos contain physical phenomena that are both accurate and sufficiently diverse that retrieval will correct rather than reinforce model errors.

What would settle it

A controlled test in which the retrieval step is disabled and physical-compliance scores on PhyGenBench fall to or below the scores of the strongest non-RAG baseline.

Figures

Figures reproduced from arXiv: 2606.26916 by Chunhe Song, Hao Tang, Kexu Cheng, Mingju Gao, Zicheng Liu.

Figure 1
Figure 1. Figure 1: Overview and Qualitative Results of PhysRAG. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Two-stage preprocessing for WISA-80K. We first use Qwen3-VL (caption-only) to select the top 10% candidates, then use Qwen3-VL (key frames + caption) to verify prompt–frame consistency and retain 90%, yielding a high-quality subset (∼7K videos) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our retrieval-augmented video diffusion framework. A Video Diffu￾sion Transformer (DiT) generates videos from text and noise while retrieving relevant physical videos from a database. Retrieved features are encoded by Vide MAE and injected into DiT blocks through the Query Inject module. physical phenomena (e.g., collision, combustion, and explosion). During infer￾ence, given a text prompt, we … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of physical consistency across different scenarios from PhyGenBench [47]. PhysRAG outperforms Wan2.2-5B and PhyT2V by accurately modeling physical interactions. In the robotic arm etching scenario, PhysRAG cap￾tures the continuous motion required for the task, while baseline models fail to do so. In the fluid dripping example, PhysRAG correctly preserves fluid dynamics and mate￾rial … view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on PhyGenBench [47]. We compare PhysRAG (ours) with SFT and SFT+RAG on (a) drone-over-pool reflection and (b) oil pouring. In (a), SFT and SFT+RAG miss the water-surface reflection, while PhysRAG preserves it. In (b), SFT and SFT+RAG show non-causal liquid growth without coherent downward flow; PhysRAG produces temporally consistent pouring and accumulation [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Developing physically aware video generation models remains a significant challenge due to the difficulty in capturing diverse physical phenomena, such as thermal dynamics, mechanics, and optics. In this work, we introduce PhysRAG, a novel pipeline that enhances physical awareness in video generation through Retrieval-Augmented Generation (RAG). To address the issue of limited high-quality data, we design a two-stage data filtering pipeline based on the WISA-80K dataset, resulting in a curated set of 7K high-quality videos for training. Furthermore, we construct a physical video database and develop a mechanism to inject physical knowledge into a video diffusion model using learnable queries. Our method achieves state-of-the-art performance in both visual quality and physical rule compliance, surpassing existing models in benchmarks such as PhyGenBench and VBench. We conduct extensive ablation studies to validate the effectiveness of our key components, including the data filtering pipeline, RAG mechanism, and method for physical information extraction. To facilitate future research, our code, data, and models are prepared for release at https://github.com/sediment1024/PhysRAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PhysRAG, a retrieval-augmented generation pipeline for enhancing physical awareness in video diffusion models. It describes a two-stage filtering process applied to the WISA-80K dataset to produce a curated 7K-video database of purported high physical quality, constructs a physical video database, and injects physical knowledge via learnable queries into the diffusion model. The work claims state-of-the-art performance on both visual quality and physical rule compliance on PhyGenBench and VBench, supported by ablation studies on the filtering pipeline, RAG mechanism, and physical information extraction; code, data, and models are slated for release.

Significance. If the empirical gains are shown to be robust under standard controls, the approach could offer a practical route to mitigating physical hallucinations in video generation by leveraging curated retrieval rather than purely parametric learning. The planned artifact release would aid reproducibility in a field where data curation for physics compliance is often opaque.

major comments (2)
  1. [two-stage data filtering pipeline] The two-stage data filtering pipeline (abstract and methods): the claim that the resulting 7K videos exhibit high physical fidelity sufficient for reliable hallucination correction is unsupported by any reported metrics such as physics-engine error, expert-rated compliance on conservation laws/optics, or inter-annotator agreement. This is load-bearing because the RAG mechanism's value rests on the database improving rather than reinforcing dataset biases.
  2. [benchmarks and results] Results on PhyGenBench and VBench (abstract): the SOTA claim is asserted without any quantitative tables, baseline comparisons, error bars, or effect sizes in the provided text, preventing assessment of whether reported gains survive standard controls for data leakage or evaluation variance.
minor comments (2)
  1. [abstract] The abstract states that ablation studies validate the key components but supplies no high-level quantitative outcomes or effect sizes, which would aid immediate assessment of component importance.
  2. [method for physical information extraction] The number and dimension of learnable queries are described as free parameters without discussion of sensitivity or default choices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and indicate planned revisions to improve the rigor of the claims.

read point-by-point responses
  1. Referee: [two-stage data filtering pipeline] The two-stage data filtering pipeline (abstract and methods): the claim that the resulting 7K videos exhibit high physical fidelity sufficient for reliable hallucination correction is unsupported by any reported metrics such as physics-engine error, expert-rated compliance on conservation laws/optics, or inter-annotator agreement. This is load-bearing because the RAG mechanism's value rests on the database improving rather than reinforcing dataset biases.

    Authors: We acknowledge that the manuscript does not report direct quantitative metrics (e.g., physics-engine error or expert-rated compliance scores with inter-annotator agreement) on the physical fidelity of the curated 7K videos. The two-stage pipeline applies automated heuristics based on WISA-80K annotations for visual quality and basic physical plausibility, and its value is supported indirectly via ablation studies showing performance gains on PhyGenBench when the filtered database is used. To strengthen the claim, we will add a new subsection with expert evaluation results on a random subset of the 7K videos, including compliance ratings for conservation laws and optics along with inter-annotator agreement statistics. revision: yes

  2. Referee: [benchmarks and results] Results on PhyGenBench and VBench (abstract): the SOTA claim is asserted without any quantitative tables, baseline comparisons, error bars, or effect sizes in the provided text, preventing assessment of whether reported gains survive standard controls for data leakage or evaluation variance.

    Authors: The full manuscript (Section 4 and associated tables) contains quantitative comparisons against baselines on both PhyGenBench and VBench, reporting metrics for visual quality and physical compliance. However, the current version lacks error bars from repeated runs and formal analysis of effect sizes or data leakage controls. We will revise the experimental section to include these elements, along with explicit discussion of steps taken to avoid overlap between the physical video database and benchmark evaluation prompts. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical pipeline with external benchmarks only

full rationale

The paper presents an empirical pipeline: two-stage filtering of WISA-80K to 7K videos, construction of a physical video database, and injection of physical knowledge via learnable queries into a diffusion model. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. SOTA claims rest on external benchmarks (PhyGenBench, VBench) rather than any quantity defined inside the paper. The absence of mathematical structure means none of the six enumerated circularity patterns can be exhibited by quote and reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that the curated 7K videos plus retrieval will supply sufficient physical constraints. The diffusion backbone itself is treated as a black-box prior. No new physical laws or particles are postulated.

free parameters (1)
  • number and dimension of learnable queries
    Design choice whose count and size are not derived from first principles and must be selected to make retrieval effective.
axioms (1)
  • domain assumption Existing video diffusion models can be conditioned on external retrieved features without destroying their generative capability.
    Invoked when the paper states that physical knowledge is injected via learnable queries.

pith-pipeline@v0.9.1-grok · 5738 in / 1417 out tokens · 20910 ms · 2026-06-26T05:14:50.972576+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

93 extracted references · 1 canonical work pages

  1. [1]

    In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J.L., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Bińkowski, M.a., Barreira, R., Vinyals, O., Zisser- man, A., Simonyan, K.: Fla...

  2. [2]

    arXiv preprint arXiv:2511.21631 (2025)

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  3. [3]

    arXiv preprint arXiv:2503.06800 (2025)

    Bansal, H., Peng, C., Bitton, Y., Goldenberg, R., Grover, A., Chang, K.W.: Videophy-2: A challenging action-centric physical commonsense evaluation in video generation. arXiv preprint arXiv:2503.06800 (2025)

  4. [4]

    In: SIGGRAPH Asia 2024 Conference Papers

    Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., et al.: Lumiere: A space-time diffusion model for video generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

  5. [5]

    Advances in neural infor- mation processing systems29(2016)

    Battaglia, P., Pascanu, R., Lai, M., Jimenez Rezende, D., et al.: Interaction net- works for learning about objects, relations and physics. Advances in neural infor- mation processing systems29(2016)

  6. [6]

    arXiv preprint arXiv:2106.08261 (2021)

    Bear, D.M., Wang, E., Mrowca, D., Binder, F.J., Tung, H.Y.F., Pramod, R., Hold- away, C., Tao, S., Smith, K., Sun, F.Y., et al.: Physion: Evaluating physical predic- tion from vision in humans and machines. arXiv preprint arXiv:2106.08261 (2021)

  7. [7]

    arXiv preprint arXiv:2410.24164 (2024)

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

  8. [8]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion mod- els. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023)

  9. [9]

    Advances in Neural Information Processing Systems 35, 15309–15324 (2022)

    Blattmann, A., Rombach, R., Oktay, K., Müller, J., Ommer, B.: Retrieval- augmented diffusion models. Advances in Neural Information Processing Systems 35, 15309–15324 (2022)

  10. [10]

    OpenAI Blog1(8), 1 (2024)

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024)

  11. [11]

    arXiv preprint arXiv:2512.24551 (2025)

    Cai, Y., Li, K., Jia, M., Wang, J., Sun, J., Liang, F., Chen, W., Juefei-Xu, F., Wang, C., Thabet, A., et al.: Phygdpo: Physics-aware groupwise direct prefer- ence optimization for physically consistent text-to-video generation. arXiv preprint arXiv:2512.24551 (2025)

  12. [12]

    arXiv preprint arXiv:1612.00341 (2016) 18 K

    Chang, M.B., Ullman, T., Torralba, A., Tenenbaum, J.B.: A compositional object- based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341 (2016) 18 K. Cheng et al

  13. [13]

    arXiv preprint arXiv:2310.19512 (2023)

    Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Yang, S., Xing, J., Liu, Y., Chen, Q., Wang, X., et al.: Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 (2023)

  14. [14]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., Shan, Y.: Videocrafter2: Overcoming data limitations for high-quality video diffusion mod- els. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7310–7320 (2024)

  15. [15]

    arXiv preprint arXiv:2508.10858 (2025)

    Chen, H.H., Huang, H., Chen, Q., Yang, H., Lim, S.N.: Hierarchical fine-grained preference optimization for physically plausible video generation. arXiv preprint arXiv:2508.10858 (2025)

  16. [16]

    In: Pro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing

    Chen, W., Hu, H., Chen, X., Verga, P., Cohen, W.: Murag: Multimodal retrieval- augmented generator for open question answering over images and text. In: Pro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing. pp. 5558–5570 (2022)

  17. [17]

    arXiv preprint arXiv:2209.14491 (2022)

    Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491 (2022)

  18. [18]

    Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., Jégou, H.: The faiss library (2025),https://arxiv.org/abs/ 2401.08281

  19. [19]

    arXiv preprint arXiv:2501.08453 (2025)

    Fan, W., Si, C., Song, J., Yang, Z., He, Y., Zhuo, L., Huang, Z., Dong, Z., He, J., Pan, D., et al.: Vchitect-2.0: Parallel transformer for scaling up video diffusion models. arXiv preprint arXiv:2501.08453 (2025)

  20. [20]

    arXiv preprint arXiv:2506.09113 (2025)

    Gao, Y., Guo, H., Hoang, T., Huang, W., Jiang, L., Kong, F., Li, H., Li, J., Li, L., Li, X., et al.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025)

  21. [21]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D.J., Gnanapragasam, D., Golemo, F., Herrmann, C., et al.: Kubric: A scalable dataset generator. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3749–3761 (2022)

  22. [22]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Guen, V.L., Thome, N.: Disentangling physical dynamics from unknown factors for unsupervised video prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11474–11484 (2020)

  23. [23]

    Hao, Y., Chen, C., Mian, A.S., Xu, C., Liu, D.: Enhancing physical plausibility in video generation by reasoning the implausibility (2025),https://arxiv.org/abs/ 2509.24702

  24. [24]

    arXiv preprint arXiv:2210.02303 (2022)

    Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

  25. [25]

    Advances in neural information processing systems35, 8633– 8646 (2022)

    Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022)

  26. [26]

    arXiv preprint arXiv:2205.15868 (2022)

    Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretrain- ing for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)

  27. [27]

    Huang, Y., Lv, X., Xu, J., Yu, Z., Zhang, J., Hu, R., Feng, W., Zou, S., Xiao, H., Zhou, Z., Huang, K., Peng, Z., Xu, J., Zhao, H., Zhu, C., Yi, R., Huang, Y., Wu, D., Zhang, Y., Cheng, K., Song, C., Xue, Y., Zhang, X., Guo, L., Chen, Y., Wu, B., Yu, H., Xu, K.: Paiworld: A 3d-consistent world foundation model for robotic manipulation (2026),https://arxiv...

  28. [28]

    arXiv preprint arXiv:2505.11528 (2025)

    Huang, Y., Zhang, J., Zou, S., Liu, X., Hu, R., Xu, K.: Ladi-wm: A latent diffusion- based world model for predictive manipulation. arXiv preprint arXiv:2505.11528 (2025)

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)

  30. [30]

    arXiv preprint arXiv:2410.21276 (2024)

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  31. [31]

    Intelligence, P., Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., Driess, D., Equi, M., Es- mail, A., Fang, Y., Finn, C., Glossop, C., Godden, T., Goryachev, I., Groom, L., Hancock, H., Hausman, K., Hussein, G., Ichter, B., Jakubczak, S., Jen, R., Jones, T., Katz, B., Ke, L., Kuchi, C.,...

  32. [32]

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M.Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Ren, A.Z., Shi, L.X., Smith, L., Springenberg, J.T., Stachow...

  33. [33]

    arXiv preprint arXiv:2510.13809 (2025)

    Ji, S., Chen, X., Tao, X., Wan, P., Zhao, H.: Physmaster: Mastering physical representation for video generation via reinforcement learning. arXiv preprint arXiv:2510.13809 (2025)

  34. [34]

    arXiv preprint arXiv:2411.02385 (2024)

    Kang, B.,Yue,Y.,Lu, R.,Lin, Z.,Zhao,Y., Wang, K.,Huang,G., Feng,J.: Howfar is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385 (2024)

  35. [35]

    arXiv preprint arXiv:2412.03603 (2024)

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

  36. [36]

    Kuaishou Technology: Kling ai: High-fidelity text-to-video generation.https:// www.kuaishou.com/kling(2024)

  37. [37]

    Advances in neural information processing systems 33, 9459–9474 (2020)

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 33, 9459–9474 (2020)

  38. [38]

    arXiv preprint arXiv:2503.09595 (2025)

    Li, C., Michel, O., Pan, X., Liu, S., Roberts, M., Xie, S.: Pisa experiments: Explor- ing physics post-training for video diffusion models by watching stuff drop. arXiv preprint arXiv:2503.09595 (2025)

  39. [39]

    In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

    Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp....

  40. [40]

    IEEE Robotics and Automation Letters9(6), 6012–6019 (2024)

    Liu, C., Shi, K., Zhou, K., Wang, H., Zhang, J., Dong, H.: Rgbgrasp: Image-based object grasping by capturing multiple views during robot arm movement with neural radiance fields. IEEE Robotics and Automation Letters9(6), 6012–6019 (2024)

  41. [41]

    arXiv preprint arXiv:2501.10928 (2025)

    Liu, D., Zhang, J., Dinh, A.D., Park, E., Zhang, S., Mian, A., Shah, M., Xu, C.: Generative physical ai in vision: A survey. arXiv preprint arXiv:2501.10928 (2025)

  42. [42]

    arXiv preprint arXiv:2501.13918 (2025)

    Liu, J., Liu, G., Liang, J., Yuan, Z., Liu, X., Zheng, M., Wu, X., Wang, Q., Xia, M., Wang, X., et al.: Improving video generation with human feedback. arXiv preprint arXiv:2501.13918 (2025)

  43. [43]

    In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

    Liu, R., Wu, H., Zheng, Z., Wei, C., He, Y., Pi, R., Chen, Q.: Videodpo: Omni- preference alignment for video diffusion generation. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 8009–8019 (2025)

  44. [44]

    arXiv preprint arXiv:2402.17177 (2024)

    Liu, Y., Zhang, K., Li, Y., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y., Sun, H., Gao, J., et al.: Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177 (2024)

  45. [45]

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019),https: //arxiv.org/abs/1711.05101

  46. [46]

    Ma, G., Huang, H., Yan, K., Chen, L., Duan, N., Yin, S., Wan, C., Ming, R., Song, X., Chen, X., Zhou, Y., Sun, D., Zhou, D., Zhou, J., Tan, K., An, K., Chen, M., Ji, W., Wu, Q., Sun, W., Han, X., Wei, Y., Ge, Z., Li, A., Wang, B., Huang, B., Wang, B., Li, B., Miao, C., Xu, C., Wu, C., Yu, C., Shi, D., Hu, D., Liu, E., Yu, G., Yang, G., Huang, G., Yan, G.,...

  47. [47]

    arXiv preprint arXiv:2410.05363 (2024)

    Meng, F., Liao, J., Tan, X., Shao, W., Lu, Q., Zhang, K., Cheng, Y., Li, D., Qiao, Y., Luo, P.: Towards world simulator: Crafting physical commonsense-based benchmark for video generation. arXiv preprint arXiv:2410.05363 (2024)

  48. [48]

    arXiv preprint arXiv:2502.07007 (2025)

    Meng, S., Luo, Y., Liu, P.: Grounding creativity in physics: A brief survey of physical priors in aigc. arXiv preprint arXiv:2502.07007 (2025)

  49. [49]

    In: Proceedings of the AAAI conference on artificial intelligence

    Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 4296–4304 (2024)

  50. [50]

    In: Proceedings of the 2025 International Conference on Multimedia Retrieval

    Peruzzo, E., Xu, D., Xu, X., Shi, H., Sebe, N.: Ragme: Retrieval augmented video generation for enhanced motion realism. In: Proceedings of the 2025 International Conference on Multimedia Retrieval. p. 1081–1090. ICMR ’25, Association for Computing Machinery, New York, NY, USA (2025).https://doi.org/10.1145/ 3731715.3733417,https://doi.org/10.1145/3731715.3733417

  51. [51]

    Pika Labs: Pika: Text-to-video generation platform.https://pika.art(2024)

  52. [52]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) PhysRAG 21

  53. [53]

    arXiv preprint arXiv:2510.02284 (2025)

    Romero, D., Bermudez, A., Li, H., Pizzati, F., Laptev, I.: Learning to generate rigid body interactions with video diffusion models. arXiv preprint arXiv:2510.02284 (2025)

  54. [54]

    Runway ML: Runway gen-3 alpha.https://research.runwayml.com/gen3(2024)

  55. [55]

    arXiv preprint arXiv:2209.14792 (2022)

    Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text- video data. arXiv preprint arXiv:2209.14792 (2022)

  56. [56]

    Song, S., Xu, Z., Zhang, Z., Zhou, K., Guo, J., Qin, L., Huang, B.: Learning plug- and-play memory for guiding video diffusion models (2025),https://arxiv.org/ abs/2511.19229

  57. [57]

    arXiv preprint arXiv:2605.23345 (2026)

    Tong, Z., Lai, H., Wang, Z., Xing, Z., Cheng, K., Xu, H., Pu, Z., Zhu, S., Feng, R., Zhao, J., et al.: Scope: Simulating cross-game operations in playable environments for fps world models. arXiv preprint arXiv:2605.23345 (2026)

  58. [58]

    arXiv preprint arXiv:1909.13789 (2019)

    Toth, P., Rezende, D.J., Jaegle, A., Racanière, S., Botev, A., Higgins, I.: Hamilto- nian generative networks. arXiv preprint arXiv:1909.13789 (2019)

  59. [59]

    Advances in Neural Information Processing Systems36, 67048–67068 (2023)

    Tung,H.Y.,Ding,M.,Chen,Z.,Bear,D.,Gan,C.,Tenenbaum,J.,Yamins,D.,Fan, J., Smith, K.: Physion++: Evaluating physical scene understanding that requires online inference of different physical properties. Advances in Neural Information Processing Systems36, 67048–67068 (2023)

  60. [60]

    arXiv preprint arXiv:2503.20314 (2025)

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  61. [61]

    arXiv preprint arXiv:2509.20358 (2025)

    Wang, C., Chen, C., Huang, Y., Dou, Z., Liu, Y., Gu, J., Liu, L.: Physctrl: Genera- tive physics for controllable and physics-grounded video generation. arXiv preprint arXiv:2509.20358 (2025)

  62. [62]

    arXiv preprint arXiv:2604.24575 (2026)

    Wang, H., Xiang, A., Sun, H., Sun, P., Pan, C., Chen, Y., Hong, M., Wang, W., Chen, S., Chen, Y., et al.: Diffusion model as a generalist segmentation learner. arXiv preprint arXiv:2604.24575 (2026)

  63. [63]

    In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

    Wang, H., Zhou, K., Gu, B., Feng, Z., Wang, W., Sun, P., Xiao, Y., Zhang, J., Dong, H.: Transdiff: Diffusion-based method for manipulating transparent objects using a single rgb-d image. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 7277–7283. IEEE (2025)

  64. [64]

    In: Proceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing

    Wang, J., Wang, C., Huang, K., Huang, J., Jin, L.: Videoclip-xl: Advancing long description understanding for video clip models. In: Proceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing. pp. 16061–16075 (2024)

  65. [65]

    arXiv preprint arXiv:2503.08153 (2025)

    Wang, J., Ma, A., Cao, K., Zheng, J., Zhang, Z., Feng, J., Liu, S., Ma, Y., Cheng, B., Leng, D., et al.: Wisa: World simulator assistant for physics-aware text-to-video generation. arXiv preprint arXiv:2503.08153 (2025)

  66. [66]

    arXiv preprint arXiv:2308.06571 (2023)

    Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text- to-video technical report. arXiv preprint arXiv:2308.06571 (2023)

  67. [67]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: Videomae v2: Scaling video masked autoencoders with dual masking. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14549–14560 (June 2023)

  68. [68]

    arXiv preprint arXiv:2511.03997 (2025)

    Wang, P., Wang, W., Li, Q.: Physcorr: Dual-reward dpo for physics-constrained text-to-video generation with automated preference selection. arXiv preprint arXiv:2511.03997 (2025)

  69. [69]

    arXiv preprint arXiv:2509.19297 (2025) 22 K

    Wang, W., Chen, Y., Zhang, Z., Liu, H., Wang, H., Feng, Z., Qin, W., Chen, F., Zhu, Z., Chen, D.Y., et al.: Volsplat: Rethinking feed-forward 3d gaussian splatting with voxel-aligned prediction. arXiv preprint arXiv:2509.19297 (2025) 22 K. Cheng et al

  70. [70]

    arXiv preprint arXiv:2604.24764 (2026)

    Wang, W., He, X., Gu, Y., Yang, Y., Zhang, Z., He, Y., Ding, Y., Hu, X., Chen, D.Y., He, Z., et al.: World-r1: Reinforcing 3d constraints for text-to-video genera- tion. arXiv preprint arXiv:2604.24764 (2026)

  71. [71]

    In: European conference on computer vision

    Wang, X., Zhu, Z., Huang, G., Chen, X., Zhu, J., Lu, J.: Drivedreamer: Towards real-world-drive world models for autonomous driving. In: European conference on computer vision. pp. 55–72. Springer (2024)

  72. [72]

    International Journal of Computer Vision133(5), 3059–3078 (2025)

    Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al.: Lavie: High-quality video generation with cascaded latent diffu- sion models. International Journal of Computer Vision133(5), 3059–3078 (2025)

  73. [73]

    Advances in neural information processing systems30(2017)

    Watters, N., Zoran, D., Weber, T., Battaglia, P., Pascanu, R., Tacchetti, A.: Visual interaction networks: Learning a physics simulator from video. Advances in neural information processing systems30(2017)

  74. [74]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wen, Y., Zhao, Y., Liu, Y., Jia, F., Wang, Y., Luo, C., Zhang, C., Wang, T., Sun, X., Zhang, X.: Panacea: Panoramic and controllable video generation for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6902–6912 (2024)

  75. [75]

    arXiv preprint arXiv:2509.20328 (2025)

    Wiedemer, T., Li, Y., Vicol, P., Gu, S.S., Matarese, N., Swersky, K., Kim, B., Jaini, P., Geirhos, R.: Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328 (2025)

  76. [76]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Wu, J.Z., Ge, Y., Wang, X., Lei, S.W., Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., Shou, M.Z.: Tune-a-video: One-shot tuning of image diffusion models for text- to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 7623–7633 (October 2023)

  77. [77]

    arXiv preprint arXiv:2605.01896 (2026)

    Xiao,J.,Liang,D.,Zhou,X.,Ye,Y.,Su,T.,Yi,G.,Xia,B.,Lyu,Q.,Shi,S.,Huang, J., et al.: Divide and conquer: Decoupled representation alignment for multimodal world models. arXiv preprint arXiv:2605.01896 (2026)

  78. [78]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Xue, Q., Yin, X., Yang, B., Gao, W.: Phyt2v: Llm-guided iterative self-refinement for physics-grounded text-to-video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18826–18836 (2025)

  79. [79]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yang, X., Li, B., Zhang, Y., Yin, Z., Bai, L., Ma, L., Wang, Z., Cai, J., Wong, T.T., Lu, H., et al.: Vlipp: Towards physically plausible video generation with vision and language informed physical prior. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12360–12370 (2025)

  80. [80]

    arXiv preprint arXiv:2408.06072 (2024)

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

Showing first 80 references.