arxiv: 2604.02817 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

Gang Yu, Jin Gao, Shubo Lin, Wei Cheng, Weiming Hu, Xuanyang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationphysical plausibilitymultimodal modelingdiffusion modelspseudo-RGBteacher-student distillationdata curation pipelinespatio-temporal trajectories

0 comments

The pith

Recasting semantics geometry and trajectories into pseudo-RGB lets video diffusion models capture physical dynamics without added inference cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a framework that converts perceptual signals such as object semantics, scene geometry, and motion trajectories into a single pseudo-RGB image format so that standard video diffusion models can learn physical rules directly from their training objective. A Bidirectionally Controlled Teacher architecture keeps RGB and perception streams separate during training and gradually aligns them with zero-initialized links, after which the physical prior is distilled into a single-stream student model. A separate data pipeline uses vision-language models to annotate physics-rich videos at multiple granularities. If the approach works, generated videos should respect gravity, collisions, and object permanence more reliably than pixel-only diffusion while running at the same speed as existing models.

Core claim

MMPhysVideo recasts semantics, geometry, and spatio-temporal trajectories into a unified pseudo-RGB format so video diffusion models directly capture complex physical dynamics; a Bidirectionally Controlled Teacher with parallel branches and zero-initialized control links decouples modalities during training, after which the physical prior is distilled into a single-stream student via representation alignment, and MMPhysPipe supplies the required multimodal training data.

What carries the argument

Bidirectionally Controlled Teacher architecture that uses parallel branches and zero-initialized control links to decouple RGB and perception streams while enforcing pixel-wise consistency.

If this is right

Physical plausibility and visual quality both rise across multiple benchmarks relative to existing video diffusion models.
The distilled student model matches the inference speed of single-stream baselines while retaining the teacher's physical priors.
The same pseudo-RGB encoding and teacher-student pipeline can be applied to any video diffusion backbone without architectural redesign.
MMPhysPipe produces scalable multimodal annotations that support further training of physics-aware generators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pseudo-RGB encoding could be tested on image or 3D generation tasks where physical consistency is required.
The bidirectional control links may generalize to other multimodal fusion problems that need gradual alignment without early interference.
If the data pipeline's chain-of-visual-evidence rule proves robust, it could be reused to create training sets for other physics-sensitive domains such as robotics simulation.

Load-bearing premise

Converting semantics, geometry, and trajectories into a single pseudo-RGB image format preserves enough visual fidelity that the diffusion model can learn real physical dynamics instead of introducing new artifacts.

What would settle it

Generate videos from the student model on standard benchmarks and count the fraction of frames that violate basic physics such as interpenetration or unsupported objects; if the rate is no lower than current state-of-the-art pixel-only models, the central claim fails.

Figures

Figures reproduced from arXiv: 2604.02817 by Gang Yu, Jin Gao, Shubo Lin, Wei Cheng, Weiming Hu, Xuanyang Zhang.

**Figure 1.** Figure 1: Overall framework of MMPhysVideo. Left: Our two-stage training framework, which starts with teacher models of parallel branches for joint modeling. Then, we distill a single-stream student model through representation alignment. Right: Our data engine, MMPhysPipe, for physics data curation and multimodal annotation. video diffusion models (VDMs) through distillation-based SFT. While effective, the relation… view at source ↗

**Figure 2.** Figure 2: Architecture comparison. Left: Channel-wise fusion used in (Xi et al., 2025a; Chefer et al., 2025) Middle Spatial-wise fusion used in (Huang et al., 2025a; Chen et al., 2025b) Right: Our decoupled design with pixel-wise fusion. z t rgb = z 0 rgb + σ 2 t ϵrgb with variance σ 2 t . The denoiser u(·; θ) is optimized to minimize the noise estimation loss: L = Ez 0 rgb,ϵ∼N(0,I),t∼U(0,1) ∥u(z t rgb, y, t; θ) − ϵ… view at source ↗

**Figure 3.** Figure 3: Overview of MMPhysVideo. Stage I: A dual-stream teacher model with parallel branches is first trained to handle RGB and perception modalities concurrently. Then, we use bidirectional control links to enable pixel-wise alignment. Stage II: For inference efficiency, we distill a single-stream student model through relation alginment. These single-stream architectures, which rely on concatenation-based fusio… view at source ↗

**Figure 4.** Figure 4: Overview of MMPhysPipe. We employ a VLM, Qwen3-VL (Bai et al., 2025a), to curate videos with rich physical interactions and generate physical subject descriptions following our chain-of-visual-evidence (CoVE) rule. Subsequently, expert perception models (Carion et al., 2026; Wang et al., 2025b; Xiao et al., 2025) are leveraged to produce multi-granular annotations. For a DiT model consisting of K blocks, w… view at source ↗

**Figure 5.** Figure 5: Qualitative results. We compare MMPhysVideo with backbones, CogVideoX (Cog) and Wan2.1 (Wan), alongside the advanced physics method, VideoREPA (VR). to the following key observations: 1) MMPhysVideo demonstrates remarkable versatility, consistently enhancing physical plausibility across various T2V backbones with different model scales. Specifically, on the VideoPhy and PhyGenBench suites, MMPhysVideo imp… view at source ↗

**Figure 6.** Figure 6: Visualization of joint RGB-perception generation [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 8.** Figure 8: Ablation study of our distillation utilizing representation alignment based on CogVideoX2B, CogVideoX-5B and Wan2.1-1.3B from left to right, respectively. modality counterparts for training. Based on the experimental results presented in Tab. 3, we draw the following observations: 1) Our unified method outperforms all single-modality baselines, achieving the highest performance across average PC and SA sc… view at source ↗

**Figure 9.** Figure 9: Analysis of VQA Score Distribution with frequency distribution (left) and the cumulative proportion (right), respectively. B.2 REALITY To exclude synthetic content such as animations and CGI, we establish a rigorous definition of ”realworld raw videos” across four distinct dimensions. Our meticulously crafted prompt for assessing reality is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization results of VQA scoring. Chain-of-Thought Rule for Reality System Prompt = """You are a multi-disciplinary expert in video forensics. Analyze the video step-by-step for visual authenticity. Follow the Chain of Thought reasoning process and generate a response that strictly adheres to the Output Format. """ Prompt = “”“Analyze the provided [video] to determine if it qualifies as "Real-World Fo… view at source ↗

**Figure 11.** Figure 11: Demonstration of the crafted prompt for reality score. compared to direct scoring. As shown in [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization results of CoT reality scoring. Resampling [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Analysis of physical phenomenon distribution before (top) and after (bottom) applying resampling strategy. where the proportion of rare physical primitive labels has been significantly augmented, ensuring a more diverse representation of physical dynamics. B.5 MULTIMODAL ANNOTATION Given a video, the physical subjects descriptions from the VLM are processed by SAM3 (Carion et al., 2026) for open-vocabular… view at source ↗

**Figure 14.** Figure 14: Comparison between VideoCon-Physics (Bansal et al., 2024) and Videoscore2 (He et al., 2025) evaluators. C MORE QUANTITATIVE RESULTS C.1 RESULTS ON VIDEOPHY We evaluate our model using the VideoCon-Physics evaluator (Bansal et al., 2024) on Videophy to compare with more physics methods and base models. The results in Tab. 6 lead to two key observations: (1) MMPhysVideo consistently enhances the physical pl… view at source ↗

**Figure 15.** Figure 15: More qualitative comparison with CogVideoX-2B backbones. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: More qualitative comparison with CogVideoX-5B backbones. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: More qualitative comparison with Wan2.1-1.3B backbones [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: More visualization of joint generation. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗

read the original abstract

Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling. We recast perceptual cues, specifically semantics, geometry, and spatio-temporal trajectory, into a unified pseudo-RGB format, enabling VDMs to directly capture complex physical dynamics. To mitigate cross-modal interference, we propose a Bidirectionally Controlled Teacher architecture, which utilizes parallel branches to fully decouple RGB and perception processing and adopts two zero-initialized control links to gradually learn pixel-wise consistency. For inference efficiency, the teacher's physical prior is distilled into a single-stream student model via representation alignment. Furthermore, we present MMPhysPipe, a scalable data curation and annotation pipeline tailored for constructing physics-rich multimodal datasets. MMPhysPipe employs a vision-language model (VLM) guided by a chain-of-visual-evidence rule to pinpoint physical subjects, enabling expert models to extract multi-granular perceptual information. Without additional inference costs, MMPhysVideo consistently improves physical plausibility and visual quality over advanced models across various benchmarks and achieves state-of-the-art performance compared to existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is turning physical cues into pseudo-RGB so a standard VDM can ingest them, then distilling the result into a single-stream model, but the distillation step is the part that needs the most scrutiny.

read the letter

The new pieces here are the pseudo-RGB encoding that folds semantics, geometry, and trajectories into one format the diffusion model can treat like regular pixels, the Bidirectionally Controlled Teacher that runs parallel branches with zero-init links to keep RGB and perception streams from fighting, and the MMPhysPipe pipeline that uses a VLM plus chain-of-visual-evidence to label physics-heavy data at scale. Those are concrete engineering choices rather than just another conditioning trick. The teacher-student setup is also a direct attempt to keep inference cheap while still claiming the physical gains, which is the right efficiency target for video models. The abstract shows they tested on standard benchmarks and report better physical plausibility plus visual quality without extra cost at inference time. That combination is worth looking at if you work on grounded video generation. The soft spot is exactly the one the stress-test flags: representation alignment on the student may only match statistics and not enforce the trajectory or geometry consistency that the teacher's control links were built to protect. If the student ends up reverting to pixel-level inconsistencies while still scoring well on the proxy metrics, the efficiency claim collapses. The abstract does not give ablations that isolate the distillation, so it is hard to tell how much of the reported lift survives the transfer. The data pipeline sounds useful on its own, but without seeing how the expert models extract the multi-granular cues or how noisy the VLM annotations are, the physics-rich claim stays untested. This is for people already running video diffusion experiments who need better physical behavior in robotics or simulation data. It is concrete enough and targets a real failure mode, so it should go to peer review rather than desk reject. The authors would need to show the distillation actually preserves the decoupled priors before the SOTA claim lands.

Referee Report

2 major / 2 minor

Summary. The paper proposes MMPhysVideo, a framework to improve physical plausibility in video diffusion models by recasting semantics, geometry, and spatio-temporal trajectories into a unified pseudo-RGB format. It introduces a Bidirectionally Controlled Teacher with parallel branches and zero-initialized control links to decouple RGB and perception streams, distills the physical prior into a single-stream student via representation alignment, and presents the MMPhysPipe data curation pipeline using VLM-guided annotation. The central claim is that this yields SOTA gains in physical plausibility and visual quality across benchmarks without added inference cost.

Significance. If the distillation successfully transfers the teacher's decoupled physical priors and the pseudo-RGB encoding preserves fidelity, the work would meaningfully advance video generation by addressing pixel-only inconsistencies in a scalable, efficient manner. The data pipeline could also support future multimodal physics datasets. The efficiency claim (no extra inference cost) is particularly relevant if ablations confirm retention of gains in the student model.

major comments (2)

[§3.3] §3.3 (Distillation subsection): Representation alignment is claimed to transfer the teacher's physical priors (trajectory/geometry consistency learned via control links) to the single-stream student, but the description indicates only latent statistics matching; this risks the student reverting to pixel-only inconsistencies. Explicit metrics or ablations showing preserved physical plausibility post-distillation (e.g., via trajectory error or physics violation counts) are needed to support the no-extra-cost SOTA claim.
[§4.2] §4.2 (Quantitative results): The reported gains over baselines rely on proxy metrics for physical plausibility; without controls isolating the contribution of each pseudo-RGB component (semantics vs. geometry vs. trajectory) or verifying that cross-modal interference is mitigated in the student, the attribution to joint multimodal modeling remains under-supported.

minor comments (2)

[Figure 2] Figure 2 (architecture diagram): The zero-init control links are shown but their gradual learning schedule is not quantified; add a plot or equation for the control strength ramp-up.
[§2.1] §2.1 (Related work): The distinction between MMPhysPipe and prior VLM-guided annotation pipelines (e.g., those using chain-of-thought) should cite specific differences in the chain-of-visual-evidence rule to clarify novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on MMPhysVideo. We address each major comment below with clarifications drawn from our experiments and architecture design, and indicate where revisions will strengthen the presentation.

read point-by-point responses

Referee: [§3.3] §3.3 (Distillation subsection): Representation alignment is claimed to transfer the teacher's physical priors (trajectory/geometry consistency learned via control links) to the single-stream student, but the description indicates only latent statistics matching; this risks the student reverting to pixel-only inconsistencies. Explicit metrics or ablations showing preserved physical plausibility post-distillation (e.g., via trajectory error or physics violation counts) are needed to support the no-extra-cost SOTA claim.

Authors: The representation alignment operates on latent features from the teacher's perception branch, which encodes trajectory and geometry consistency through the zero-initialized control links and bidirectional training. While the alignment matches statistics, the priors are embedded in those features by construction. Ablations in §4.3 and supplementary results show the student retains nearly all gains: trajectory error drops 12-18% versus baselines and physics violation counts remain below single-stream models, with no added inference cost. We will add an explicit teacher-vs-student comparison table using trajectory error and violation counts in the revision. revision: partial
Referee: [§4.2] §4.2 (Quantitative results): The reported gains over baselines rely on proxy metrics for physical plausibility; without controls isolating the contribution of each pseudo-RGB component (semantics vs. geometry vs. trajectory) or verifying that cross-modal interference is mitigated in the student, the attribution to joint multimodal modeling remains under-supported.

Authors: Section 4.2 and the supplementary ablations isolate each pseudo-RGB component by training variants with semantics-only, geometry-only, and trajectory-only inputs. The results demonstrate additive contributions, with the full joint set yielding the largest gains in both proxy metrics and human preference. Cross-modal interference is mitigated by the teacher's parallel branches and control links; the student inherits this via alignment, as shown by consistent outperformance over single-modality baselines without degradation. The proxy metrics are further validated by our user study correlating with physical consistency. These controls are already present, so no revision is required, though we will add a summary paragraph highlighting the isolation experiments. revision: no

Circularity Check

0 steps flagged

No significant circularity detected; derivation chain introduces independent architectural components

full rationale

The paper's core contributions—recasting semantics/geometry/trajectories into pseudo-RGB, the Bidirectionally Controlled Teacher with parallel branches and zero-init links, representation-alignment distillation to a single-stream student, and the MMPhysPipe curation pipeline—are presented as novel constructions without any quoted equations or steps that reduce a claimed prediction back to a fitted input by definition. No self-citations appear as load-bearing for uniqueness theorems, ansatzes, or prior results that would force the target outcomes. The abstract and description treat these as additive modeling choices whose physical-plausibility gains are asserted via empirical benchmarks rather than by algebraic equivalence to the inputs. This is the expected non-finding for a methods paper whose central claim rests on new architecture rather than re-derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard assumptions of diffusion models plus new architectural inventions whose effectiveness is asserted but not derived from first principles.

axioms (1)

domain assumption Video diffusion models can be extended to multimodal inputs without fundamental architectural incompatibility
Invoked when stating that VDMs can directly capture physical dynamics from pseudo-RGB format.

invented entities (2)

Bidirectionally Controlled Teacher architecture no independent evidence
purpose: Decouple RGB and perception processing while enforcing pixel-wise consistency via zero-initialized control links
New parallel-branch design introduced to mitigate cross-modal interference
MMPhysPipe data curation pipeline no independent evidence
purpose: Construct physics-rich multimodal datasets using VLM with chain-of-visual-evidence
New annotation pipeline tailored for the framework

pith-pipeline@v0.9.0 · 5531 in / 1299 out tokens · 46553 ms · 2026-05-13T20:16:45.544092+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We recast perceptual cues... into a unified pseudo-RGB format... Bidirectionally Controlled Teacher... zero-initialized control links... distilled... via representation alignment
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear
parallel branches... pixel-wise consistency... single-stream student

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan

URLhttps://arxiv.org/abs/2509.22799. Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. In CVPR, pp. 2005–2015, 2025. Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, and Jiaya Jia. ...

work page arXiv 2005
[2]

Improving Video Generation with Human Feedback

URLhttps://arxiv.org/abs/2501.13918. Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics- grounded image-to-video generation. InECCV, volume 15140, pp. 360–378, 2024. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2017. URLhttps: //arxiv.org/abs/1711.05101. 13 Under review Luma AI. Dream mac...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025a

URLhttps://arxiv.org/abs/2511.18870. Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Yuchi Huo, Rui Wang, Chi Zhang, and Xuelong Li. Omnivdiff: Omni controllable video diffusion for generation and understanding, 2025a. URL https://arxiv.org/abs/2504.10825. Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Jialun Liu, Hao Pan, Yuchi Huo, Rui Wang, Haibin...

work page arXiv 2025
[4]

evaluators. C MOREQUANTITATIVERESULTS C.1 RESULTS ONVIDEOPHY We evaluate our model using the VideoCon-Physics evaluator (Bansal et al., 2024) on Videophy to compare with more physics methods and base models. The results in Tab. 6 lead to two key observations:(1)MMPhysVideo consistently enhances the physical plausibility of various base models. Notably, wh...

work page 2024