pith. sign in

arxiv: 2509.24702 · v2 · submitted 2025-09-29 · 💻 cs.CV

Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

Pith reviewed 2026-05-18 12:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords physical plausibilityvideo generationdiffusion modelscounterfactual promptssynchronized decoupled guidancetraining-freephysics-aware reasoningimplausibility
0
0 comments X p. Extension

The pith

A training-free pipeline reasons about physics violations to guide diffusion models away from implausible video motions at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that diffusion models for video can be made more physically plausible without any retraining by explicitly identifying implausibilities and steering generation away from them during the denoising process. It builds a lightweight reasoning step that produces counterfactual prompts describing physics-violating actions, then applies a new synchronized decoupled guidance mechanism to suppress those violations consistently across the generation trajectory. A reader would care because this sidesteps the expense and incompleteness of forcing models to absorb physical rules only through massive text-video datasets, offering a plug-in way to improve fidelity while keeping visual quality intact.

Core claim

We introduce a training-free framework that improves physical plausibility at inference time by explicitly reasoning about implausibility and guiding the generation away from it. Specifically, we employ a lightweight physics-aware reasoning pipeline to construct counterfactual prompts that deliberately encode physics-violating behaviors. Then, we propose a novel Synchronized Decoupled Guidance (SDG) strategy, which leverages these prompts through synchronized directional normalization to counteract lagged suppression and trajectory-decoupled denoising to mitigate cumulative trajectory bias, ensuring that implausible content is suppressed immediately and consistently throughout denoising. Abl

What carries the argument

Synchronized Decoupled Guidance (SDG), which combines synchronized directional normalization to prevent lagged suppression with trajectory-decoupled denoising to avoid cumulative bias when steering away from physics-violating counterfactual prompts.

If this is right

  • Physical fidelity improves substantially across domains while photorealism is preserved without extra training.
  • The physics-aware reasoning component and the full SDG strategy prove complementary through ablation.
  • Each element of SDG individually aids suppression of implausible content and overall plausibility gains.
  • The method supplies a plug-and-play physics-aware paradigm usable on existing diffusion video generators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar counterfactual reasoning could be tested on image or 3D generation tasks where logical consistency matters.
  • Wider use of inference-time guidance might lessen the data volume needed to train future generative models for physical domains.
  • The approach could be paired with existing video editing tools to correct specific implausible segments after initial generation.

Load-bearing premise

The lightweight physics-aware reasoning pipeline can reliably build counterfactual prompts that capture genuine physics violations, and the synchronized decoupled guidance can suppress those violations without introducing new artifacts or lowering photorealism.

What would settle it

A side-by-side evaluation on a fixed set of text prompts measuring the rate of physics violations such as incorrect gravity or interpenetration in generated videos, alongside unchanged or improved scores on standard visual quality metrics like FID or user preference for realism.

Figures

Figures reproduced from arXiv: 2509.24702 by Ajmal Saeed Mian, Chang Xu, Chen Chen, Daochang Liu, Yutong Hao.

Figure 1
Figure 1. Figure 1: Overall framework. Left: Physics-Aware Reasoning (PAR). Given a user prompt, an LLM identifies entities, interactions, and scene conditions to produce a structured analysis of the underlying physical process. Based on this reasoning, it constructs counterfactual prompts that preserve the same entities and scenes but deliberately violate the governing physical law, yielding targeted physics-aware negatives.… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison with Wan2.1. Prompt: “A vibrant, elastic tennis ball is thrown forcefully towards the ground, capturing its dynamic interaction with the surface upon impact.” Baseline: The tennis ball’s motion is inconsistent with gravity-driven dynamics, with limited deformation on impact and abrupt transitions across frames. The bounce lacks elasticity. Ours: Our result shows a more natural downwa… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison with CogvideoX. Prompt: “A yellow highlighter is used to mark on the rough, brown surface of a cardboard, showcasing the interaction between the highlighter and the cardboard surface.” Baseline: Generates inconsistent strokes, with the yellow mark appearing flat and disconnected from the cardboard’s texture. The contact point with the marker is visually unconvincing. Ours: Produces a… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with Wan2.1. Prompt: “A silver spoon is slowly inserted into a glass of crystal-clear water, revealing the fascinating visual changes and reflections as the spoon interacts with the liquid.” Baseline: The generated sequence struggles to capture realistic refraction and liquid interaction. The spoon appears disconnected from the water surface, and the reflections lack physical plausib… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison with Wan2.1. Prompt: “A timelapse captures the transformation of water in a pot as the temperature rapidly rises above 100°C.” Baseline: The sequence unrealistically depicts explosive splashes, ignoring the gradual bubbling and vapor release expected from water heating above 100°C. Ours: Our method captures progressive bubbling and the formation of rising vapor clouds, consistent wit… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison with Wan2.1. Prompt: “A small burning stick was thrown into a pile of hay.” Baseline: The ignition of hay is abrupt and spatially inconsistent, with flames appearing unnaturally large and sudden. Ours: Our model shows fire propagating gradually from the burning stick to the hay, with smoother flame development and more realistic local ignition dynamics. CogvideoX Ours [PITH_FULL_IMA… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison with CogvideoX. Prompt: “A concentrated, bright beam of light generated by a laser pointer is passing through a glass of thick whole milk, creating a mesmerizing display as the light interacts with the milk’s particles, casting intricate patterns and subtle hues within the fluid.” Baseline: The light beam appears static and detached from the milk medium, with minimal scattering or hu… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt: “Qualitative comparison with CogvideoX. A timelapse captures the gradual transformation of butter as the temperature rises significantly.” Baseline: The butter remains largely unchanged, with rigid textures and little indication of gradual phase transition. The thermal process is not conveyed. Ours: Our method depicts butter softening and progressively melting, accompanied by rising vapor. This bet… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison with CogvideoX. Prompt: “Equal amounts of red and blue paint are rapidly combined, with the mixture being vigorously stirred until fully blended.” Baseline: The mixing of red and blue paint is incomplete and static, with colors remaining largely separated. The blending dynamics are underdeveloped. Ours: Our sequence shows vigorous stirring, with swirling patterns and gradual blending… view at source ↗
Figure 10
Figure 10. Figure 10: Additional qualitative samples generated by our model across diverse prompts. Alongside the visual results, [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional comparisons with both the baseline and using negative prompting (NP) in CFG. The symbols (-) [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Instruction template used for physics-aware reasoning. The LLM is prompted to generate counterfactual [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Examples of physics-aware reasoning for counterfactual prompt construction across different physical [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative ablation of physics-aware reasoning for counterfactual prompt construction. We show two [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
read the original abstract

Diffusion models can generate realistic videos, but existing methods rely on implicitly learning physical reasoning from large-scale text-video datasets, which is costly, difficult to scale, and still prone to producing implausible motions that violate fundamental physical laws. We introduce a training-free framework that improves physical plausibility at inference time by explicitly reasoning about implausibility and guiding the generation away from it. Specifically, we employ a lightweight physics-aware reasoning pipeline to construct counterfactual prompts that deliberately encode physics-violating behaviors. Then, we propose a novel Synchronized Decoupled Guidance (SDG) strategy, which leverages these prompts through synchronized directional normalization to counteract lagged suppression and trajectory-decoupled denoising to mitigate cumulative trajectory bias, ensuring that implausible content is suppressed immediately and consistently throughout denoising. Experiments across different physical domains show that our approach substantially enhances physical fidelity while maintaining photorealism, despite requiring no additional training. Ablation studies confirm the complementary effectiveness of both the physics-aware reasoning component and SDG. In particular, the aforementioned two designs of SDG are also individually validated to contribute critically to the suppression of implausible content and the overall gains in physical plausibility. This establishes a new and plug-and-play physics-aware paradigm for video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce a training-free framework for improving physical plausibility in diffusion-based video generation. It employs a lightweight physics-aware reasoning pipeline to construct counterfactual prompts that encode physics-violating behaviors, followed by a novel Synchronized Decoupled Guidance (SDG) strategy that uses synchronized directional normalization to address lagged suppression and trajectory-decoupled denoising to reduce cumulative trajectory bias. Experiments across physical domains and ablation studies are said to demonstrate substantial gains in physical fidelity while preserving photorealism.

Significance. If the central claims hold, the work offers a practical inference-time alternative to costly retraining for addressing physical implausibilities in video models, which could be broadly applicable as a plug-and-play module. The explicit focus on reasoning about implausibility rather than implicit dataset learning represents a potentially useful paradigm shift, provided the pipeline and guidance mechanisms deliver the promised targeted suppression without side effects.

major comments (2)
  1. [Method] The central claim depends on the physics-aware reasoning pipeline reliably producing counterfactual prompts that encode genuine physics violations rather than superficial changes. However, the manuscript provides insufficient detail on the pipeline's identification and encoding process (e.g., in the method description), leaving open whether observed gains arise from true physical reasoning or generic prompt guidance.
  2. [Experiments and Ablations] Ablation studies are cited as confirming the effectiveness of the physics-aware component and the two SDG designs, but they do not include controls isolating physics-specific reasoning from non-physics prompt modifications. This is load-bearing for validating that improvements in physical fidelity stem from the proposed mechanisms rather than broader guidance effects.
minor comments (1)
  1. [Abstract] The abstract states that experiments show 'substantial' enhancements but does not reference specific quantitative metrics, datasets, or error bars; adding these references would improve clarity even if full details appear later.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below and revised the paper to strengthen the presentation of the method and experiments.

read point-by-point responses
  1. Referee: [Method] The central claim depends on the physics-aware reasoning pipeline reliably producing counterfactual prompts that encode genuine physics violations rather than superficial changes. However, the manuscript provides insufficient detail on the pipeline's identification and encoding process (e.g., in the method description), leaving open whether observed gains arise from true physical reasoning or generic prompt guidance.

    Authors: We agree that additional detail on the physics-aware reasoning pipeline would improve clarity. In the revised manuscript, we have expanded Section 3.1 with a step-by-step description of the identification process, which queries a language model against a fixed set of physical principles (e.g., conservation of momentum, gravity) to detect violations in the initial prompt. We also provide concrete examples of how violations are encoded into counterfactual prompts, such as inverting gravitational direction or violating object rigidity. A new algorithmic pseudocode box and illustrative figure have been added to demonstrate that the pipeline targets genuine physics violations rather than generic modifications. These changes make explicit that the observed gains derive from physics-specific reasoning. revision: yes

  2. Referee: [Experiments and Ablations] Ablation studies are cited as confirming the effectiveness of the physics-aware component and the two SDG designs, but they do not include controls isolating physics-specific reasoning from non-physics prompt modifications. This is load-bearing for validating that improvements in physical fidelity stem from the proposed mechanisms rather than broader guidance effects.

    Authors: We acknowledge the value of explicit isolation controls. In the revised manuscript, we have added a new ablation experiment (Table 4) that replaces the physics-aware counterfactual prompts with non-physics prompt modifications of comparable complexity (e.g., stylistic alterations or addition of unrelated descriptors). Quantitative results on physical fidelity metrics (e.g., violation detection scores) show that only the physics-specific prompts produce substantial gains, while generic modifications yield minimal or no improvement. These controls confirm that the benefits arise from the targeted physics reasoning rather than general guidance effects. The updated ablation section now includes this comparison alongside the original studies. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new inference-time method with independent experimental validation

full rationale

The paper introduces a training-free framework consisting of a lightweight physics-aware reasoning pipeline for counterfactual prompts and a Synchronized Decoupled Guidance (SDG) strategy with directional normalization and trajectory-decoupled denoising. These are presented as novel procedural components applied at inference time rather than derived from equations or parameters that reduce to the inputs by construction. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described approach. Ablation studies are cited as independent confirmation of component contributions, and the central claims rest on empirical results across physical domains rather than tautological reductions. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on the abstract alone, the method rests on the domain assumption that physical laws can be captured via text prompts and that directional guidance during denoising can be decoupled without side effects. No free parameters or invented physical entities are mentioned.

axioms (2)
  • domain assumption Physical laws can be reliably encoded in counterfactual text prompts by a lightweight reasoning pipeline.
    Invoked in the description of the physics-aware reasoning pipeline that constructs prompts encoding physics-violating behaviors.
  • domain assumption Synchronized directional normalization and trajectory-decoupled denoising can suppress implausible content consistently across the denoising process.
    Central to the proposed SDG strategy as described in the abstract.

pith-pipeline@v0.9.0 · 5753 in / 1553 out tokens · 40459 ms · 2026-05-18T12:51:59.143943+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 11 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, et al. Cosmos world: Foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

  2. [2]

    Motioncraft: Physics-based zero-shot video generation.arXiv preprint arXiv:2405.13557,

    Luca Savant Aira, Antonio Montanaro, Emanuele Aiello, Diego Valsesia, and Enrico Magli. Motioncraft: Physics-based zero-shot video generation.arXiv preprint arXiv:2405.13557,

  3. [3]

    Perp-neg: Re-imagine the negative prompt algorithm.arXiv preprint arXiv:2304.04968,

    Mohammadreza Armandpour, Ali Sadeghian, Huangjie Zheng, Amir Sadeghian, and Mingyuan Zhou. Perp-neg: Re-imagine the negative prompt algorithm.arXiv preprint arXiv:2304.04968,

  4. [4]

    Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800,

    Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grovera, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800,

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127,

  6. [6]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Chen Chen, Daochang Liu, Mubarak Shah, and Chang Xu. Exploring local memorization in diffusion models via bright ending attention.ICLR, 2025a. Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models.arXiv preprint arXiv:2401.09047,

  7. [7]

    Hierarchical fine-grained preference optimization for physically plausible video generation.arXiv preprint arXiv:2508.10858, 2025b

    Harold Haodong Chen, Haojian Huang, Qifeng Chen, Harry Yang, and Ser-Nam Lim. Hierarchical fine-grained preference optimization for physically plausible video generation.arXiv preprint arXiv:2508.10858, 2025b. 9 APREPRINT- Yunuo Chen, Junli Cao, Anil Kag, Vidit Goel, Sergei Korolev, Chenfanfu Jiang, Sergey Tulyakov, and Jian Ren. Towards physical understa...

  8. [8]

    Structure and content-guided video synthesis with diffusion models.arXiv preprint arXiv:2302.03011,

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models.arXiv preprint arXiv:2302.03011,

  9. [9]

    Vchitect-2.0: Parallel transformer for scaling up video diffusion models.arXiv preprint arXiv:2501.08453,

    Weichen Fan, Chenyang Si, Junhao Song, Zhenyu Yang, Yinan He, Long Zhuo, Ziqi Huang, Ziyue Dong, Jingwen He, Dongwei Pan, Yi Wang, Yuming Jiang, Yaohui Wang, Peng Gao, Xinyuan Chen, Hengjie Li, Dahua Lin, Yu Qiao, Ziwei Liu, et al. Vchitect-2.0: Parallel transformer for scaling up video diffusion models.arXiv preprint arXiv:2501.08453,

  10. [10]

    Emu video: Factorizing text-to-video generation by explicit image conditioning

    Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709,

  11. [11]

    Classifier-Free Diffusion Guidance

    Accessed: 2025-09-21. Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv:2207.12598,

  12. [12]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022a. Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and ...

  13. [13]

    Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, and Saining Xie

    Accessed: 2025-09-21. Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, and Saining Xie. Pisa experiments: Exploring physics post-training for video diffusion models by watching stuff drop.arXiv preprint arXiv:2503.09595,

  14. [14]

    Reasoning physical video generation with diffusion timestep tokens via reinforcement learning.arXiv preprint arXiv:2504.15932,

    Wang Lin, Liyu Jia, Wentao Hu, Kaihang Pan, Zhongqi Yue, Wei Zhao, Jingyuan Chen, Fei Wu, and Hanwang Zhang. Reasoning physical video generation with diffusion timestep tokens via reinforcement learning.arXiv preprint arXiv:2504.15932,

  15. [15]

    Generative physical ai in vision: A survey.arXiv preprint arXiv:2501.10928,

    Daochang Liu, Junyu Zhang, Anh-Dung Dinh, Eunbyung Park, Shichao Zhang, and Chang Xu. Generative physical ai in vision: A survey.arXiv preprint arXiv:2501.10928,

  16. [16]

    Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    Guoqing Ma, Haoyang Huang, Kun Yan, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248,

  17. [17]

    Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,

  18. [18]

    Do generative video models learn physical principles from watching videos? arXiv preprint arXiv:2501.09038, 2025

    Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038,

  19. [19]

    Scalable Diffusion Models with Transformers

    URL https://openai.com/index/ video-generation-models-as-world-simulators/. William Peebles and Saining Xie. Scalable diffusion models with transformers.arXiv preprint arXiv:2212.09748,

  20. [20]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    URLhttps://runwayml.com/research/introducing-gen-3-alpha. Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

  21. [21]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

  22. [22]

    Wisa: World simulator assistant for physics-aware text-to-video generation

    Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin, and Xiaodan Liang. Wisa: World simulator assistant for physics-aware text-to-video generation. arXiv preprint arXiv:2503.08153,

  23. [23]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023a. Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen C...

  24. [24]

    Stable diffusion 2.0 and the importance of negative prompts for good results

    Max Woolf. Stable diffusion 2.0 and the importance of negative prompts for good results. https://minimaxir.com/ 2022/11/stable-diffusion-negative-prompt/,

  25. [25]

    Vlipp: Towards physically plausible video generation with vision and language informed physical prior.arXiv preprint arXiv:2503.23368, 2025a

    Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, and Xu Jia. Vlipp: Towards physically plausible video generation with vision and language informed physical prior.arXiv preprint arXiv:2503.23368, 2025a. Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yua...

  26. [26]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404,

  27. [27]

    A silver spoon is slowly inserted into a glass of crystal-clear water, revealing the fascinating visual changes and reflections as the spoon interacts with the liquid

    11 APREPRINT- 6 Appendix Outline.This appendix provides additional results, implementation details, ablation analyses, a literature review, and a declaration on our LLM usage to further support the main paper. It is organized as follows: • Sec. 6.1presentsadditional qualitative comparisonswith CogVideoX-5B and Wan2.1-14B across mechanics, thermodynamics, ...