pith. machine review for the scientific record. sign in

arxiv: 2604.07348 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.AI· cs.GR· cs.LG· cs.RO

Recognition: no theorem link

MoRight: Motion Control Done Right

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GRcs.LGcs.RO
keywords video generationmotion controldisentangled controlmotion causalitycross-view attentionactive-passive decompositiongenerative modelsinteraction awareness
0
0 comments X

The pith

MoRight separates object motion from camera viewpoint and models causal interactions in video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MoRight to generate videos where user actions drive object motions under chosen viewpoints while ensuring realistic consequences for other objects. It does this by defining object motion in a fixed canonical view and using temporal cross-view attention to adapt it to new camera angles. Motion is split into active user inputs and passive reactions, allowing the model to learn causality directly from examples. This enables both predicting what happens after an action and figuring out what action would lead to a desired outcome, all with free camera movement. If successful, it improves controllability and physical realism over methods that entangle controls or ignore causality.

Core claim

We introduce MoRight, a unified framework for motion-controlled video generation that specifies object motion in a canonical static view, transfers it to arbitrary target viewpoints using temporal cross-view attention for disentangled control, and decomposes motion into active and passive components to learn causality from data, supporting both forward prediction of consequences and inverse recovery of actions.

What carries the argument

Temporal cross-view attention for transferring canonical object motion to target views, together with active-passive motion decomposition for causality learning.

If this is right

  • Users gain independent control over object movements and camera positions in generated videos.
  • The system predicts coherent passive reactions from user-specified active motions.
  • Desired passive outcomes can be used to infer plausible active driving actions.
  • Performance improves on benchmarks for video quality, motion accuracy, and interaction realism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This separation could support interactive tools where users sketch actions and receive physically consistent scene updates in real time.
  • The inverse reasoning path might help in animation pipelines by letting artists specify end effects and recovering the required inputs.
  • If the causality learning generalizes, the approach could transfer to simulation domains needing forward prediction of multi-object dynamics.

Load-bearing premise

The training data contains enough examples of causal object interactions so the model can learn to separate active from passive motions without supervision, and the attention transfer works without artifacts even for large viewpoint changes.

What would settle it

Provide an active motion such as one object pushing another, then verify whether the model produces the expected passive reactions like collisions or falls when the camera viewpoint changes substantially from the canonical one.

Figures

Figures reproduced from arXiv: 2604.07348 by Huan Ling, Jun Gao, Sanja Fidler, Saurabh Gupta, Shaowei Liu, Shenlong Wang, Tianchang Shen, Xuanchi Ren.

Figure 1
Figure 1. Figure 1: Given a single input image, our method enables controllable interactive motion generation with [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Model architecture. Our model adopts a dual-stream architecture with shared weights to disentangle object motion from camera motion. The canonical stream encodes motion trajectories using a track encoder and learns motion in a fixed canonical view. The target stream encodes camera pose signals through a camera encoder. The resulting motion and camera conditions are injected into every attention block of th… view at source ↗
Figure 3
Figure 3. Figure 3: Active vs. passive motion. The ac￾tive object (hand) initiates the action, while the passive object (cloth) responds. Disentangling the camera from motion alone is insufficient for realistic interactions: when a hand pushes a cup, the cup must slide; when a ball strikes a stack of blocks, the blocks must scatter. We term this as motion causality: the ability to reason plausible consequences from the given … view at source ↗
Figure 4
Figure 4. Figure 4: Data curation pipeline. Foundation models [30, 36, 59] extract depth, camera poses, and tracks from raw videos. A VLM [3] segments tracks into active/passive regions. We further optionally use a video-to-video model [21] to generate paired videos with the same object motion but different camera motions. the active (causal) motion—the intentional action applied to the scene (e.g., a hand pushing)—and 𝜏 pas … view at source ↗
Figure 5
Figure 5. Figure 5: Disentangled camera–object control. MoRight enables independent control of object motion and camera viewpoint. Rows 1-3 fix the camera and vary object motion (rows 1-2: forward reasoning; row 3: inverse reasoning), while rows 4-6 fix object motion and vary camera motion. 4.3. Disentangled Camera-Object Motion Control Existing works on controllable video generation typically focus on either camera or object… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of ATI [68], WanMove [15], and MoRight on interactive motion generation with camera control. All methods use the same input image. ATI and WanMove rely on pixel-aligned per-frame tracks (top-left), which entangle camera and object motion and require privileged future tracks. In contrast, MoRight uses only reprojected first-frame tracks. The first two rows show active motion reasoning… view at source ↗
Figure 7
Figure 7. Figure 7: Causal interaction reasoning. In the first 3 rows, we provide active motion (e.g., hand movement) as input, and the model infers the resulting passive motion (e.g., cloth movement). In the last 3 rows, we provide passive motion (e.g., ball movement), and the model infers the corresponding active motion (e.g., human movement). action (e.g., a hand pushing an object); the goal is to generate plausible intera… view at source ↗
Figure 8
Figure 8. Figure 8: Human perceptual evaluation. From 330 responses by 11 participants, our method is preferred across controllability, motion realism, and photorealism, outperforming ATI [68] and WanMove [15], which rely on privileged 3D tracks but lack interaction reasoning. 4.5. Human Perceptual Evaluation In addition to objective metrics, we conduct a human perceptual study to evaluate generation quality. ATI [68] and Wan… view at source ↗
Figure 9
Figure 9. Figure 9: Limitation analysis. Input tracks are overlaid on the first frame as in previous figures. (1) Incorrect interaction reasoning may lead to implausible outcomes (two kabobs merging). (2) Unnatural motion can occur when input tracks become temporally sparse due to occlusion (hand example). (3) Physically unrealistic dynamics may appear, such as objects disappearing during motion (soccer ball). (4) Hallucinate… view at source ↗
Figure 10
Figure 10. Figure 10: Prompt used for active and passive object identification for Qwen3 [ [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Interactive demo interface. Our system enables users to control both object and camera motion from a single image. Users draw trajectories on the first frame to specify object motion (active or passive), either by moving a selected region using keypoint trajectories or by defining fine-grained motion paths for detailed control. between 500 and 2000 tracks per iteration. At inference time, we fix the numbe… view at source ↗
Figure 12
Figure 12. Figure 12: Human perceptual evaluation interface. Given an input image, object trajectories, and a target camera motion, participants evaluate generated videos under three criteria: Controllability (matching object tracks and camera motion), Motion Realism (physically plausible interactions and scene responses), and Photorealism (overall visual quality). For each criterion, participants select the best video (multip… view at source ↗
Figure 13
Figure 13. Figure 13: Causal interaction reasoning. Input tracks are shown in color and overlaid on the generated static reference-view video. The tracks represent user actions (active) or passive trajectories. Given these inputs, our model either predicts plausible consequences (forward reasoning) or recovers feasible driving actions that produce the desired outcomes (inverse reasoning, last row). 19 [PITH_FULL_IMAGE:figures… view at source ↗
Figure 14
Figure 14. Figure 14: Additional controllable generation-1. Object motion trajectories are overlaid on the input image. For each video, we show different camera and object motion control. Each group shares the same object motion but uses different camera motions. Minor variations under the same object motion but different camera motions arise from the stochastic nature of interaction generation. 20 [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 15
Figure 15. Figure 15: Additional qualitative comparison with ATI [68], WanMove [15], and MoRight. ATI and WanMove rely on privileged 3D trajectories (with depth) projected to pixel-aligned per-frame tracks and take full interaction trajectories (active and passive) as input. In contrast, MoRight uses only first-frame active tracks without privileged information and infers plausible interactions. 21 [PITH_FULL_IMAGE:figures/fu… view at source ↗
read the original abstract

Generating motion-controlled videos--where user-specified actions drive physically plausible scene dynamics under freely chosen viewpoints--demands two capabilities: (1) disentangled motion control, allowing users to separately control the object motion and adjust camera viewpoint; and (2) motion causality, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. Existing methods fall short on both fronts: they entangle camera and object motion into a single tracking signal and treat motion as kinematic displacement without modeling causal relationships between object motion. We introduce MoRight, a unified framework that addresses both limitations through disentangled motion modeling. Object motion is specified in a canonical static-view and transferred to an arbitrary target camera viewpoint via temporal cross-view attention, enabling disentangled camera and object control. We further decompose motion into active (user-driven) and passive (consequence) components, training the model to learn motion causality from data. At inference, users can either supply active motion and MoRight predicts consequences (forward reasoning), or specify desired passive outcomes and MoRight recovers plausible driving actions (inverse reasoning), all while freely adjusting the camera viewpoint. Experiments on three benchmarks demonstrate state-of-the-art performance in generation quality, motion controllability, and interaction awareness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MoRight, a unified framework for motion-controlled video generation. Object motion is defined in a canonical static view and transferred to arbitrary target viewpoints using temporal cross-view attention, achieving disentangled control over object motion and camera. Motion is further decomposed into active (user-driven) and passive (consequence) components, with the model trained to learn motion causality from data. This enables forward reasoning (predict consequences from active motion) and inverse reasoning (recover actions from desired passive outcomes) at inference, while allowing free viewpoint adjustment. The work claims state-of-the-art results on three benchmarks for generation quality, motion controllability, and interaction awareness.

Significance. If the disentanglement via cross-view attention and the unsupervised active-passive decomposition hold up under rigorous validation, the approach could advance controllable video synthesis by enabling more physically plausible interactions and bidirectional reasoning. The inverse mode, in particular, would be a useful capability for applications in animation and simulation if the causality modeling proves robust rather than correlational.

major comments (2)
  1. [Abstract / Method] Abstract and method description (no section/equation numbers provided in the supplied text): The central claim that active-passive decomposition is learned from data to capture motion causality lacks any mention of an auxiliary loss, architectural separation, or inductive bias that would penalize non-causal mappings. In the absence of such mechanisms, the training objective can be satisfied by learning kinematic correlations or viewpoint-dependent patterns, directly undermining the forward and inverse reasoning modes asserted at inference.
  2. [Experiments] Experiments section (referenced but not detailed): The abstract asserts SOTA performance on three benchmarks for quality, controllability, and interaction awareness, yet provides no quantitative metrics, ablation studies on the active-passive split, or error analysis comparing against baselines that lack explicit causality modeling. This makes it impossible to assess whether the claimed gains are attributable to the proposed decomposition or to other factors.
minor comments (2)
  1. [Abstract] The abstract would benefit from a brief reference to the specific quantitative improvements (e.g., percentage gains on key metrics) rather than a high-level SOTA claim.
  2. [Method] Notation for active/passive components and the temporal cross-view attention mechanism should be introduced with equations or pseudocode in the method section for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important areas for clarification on the causality modeling and for strengthening the experimental validation. We address each major comment point by point below. We plan to incorporate revisions that improve clarity without altering the core technical contributions.

read point-by-point responses
  1. Referee: [Abstract / Method] The central claim that active-passive decomposition is learned from data to capture motion causality lacks any mention of an auxiliary loss, architectural separation, or inductive bias that would penalize non-causal mappings. In the absence of such mechanisms, the training objective can be satisfied by learning kinematic correlations or viewpoint-dependent patterns, directly undermining the forward and inverse reasoning modes asserted at inference.

    Authors: We appreciate this observation on the need for explicit mechanisms. The method implements the active-passive decomposition via separate prediction branches for user-driven (active) and consequence (passive) motions, with the overall training objective combining reconstruction losses on both components and a temporal consistency term that requires passive motions to be predictable from active ones under the canonical-view transfer. This architectural separation and consistency constraint serve as the inductive bias. We acknowledge that the abstract and high-level method overview do not sufficiently detail these elements, which could lead to the interpretation raised. We will revise the method section to explicitly describe the loss formulation and branch separation, including how they discourage purely correlational solutions and support the bidirectional reasoning at inference. revision: yes

  2. Referee: [Experiments] The abstract asserts SOTA performance on three benchmarks for quality, controllability, and interaction awareness, yet provides no quantitative metrics, ablation studies on the active-passive split, or error analysis comparing against baselines that lack explicit causality modeling. This makes it impossible to assess whether the claimed gains are attributable to the proposed decomposition or to other factors.

    Authors: We agree that additional experimental details are warranted to substantiate the claims. The full experiments section reports quantitative results (FID, FVD, motion accuracy, and interaction metrics) across the three benchmarks with comparisons to prior methods. However, we did not include dedicated ablations isolating the active-passive split or error breakdowns versus non-causal baselines. We will add these ablations and analyses in the revised version, including quantitative comparisons that isolate the contribution of the decomposition to controllability and interaction awareness. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on data-driven learning without self-referential reductions

full rationale

The paper presents MoRight as a learned neural framework that decomposes motion into active/passive components and uses temporal cross-view attention, with all capabilities emerging from training on video data. No equations, derivations, or parameter-fitting steps are described that would make any 'prediction' or 'result' equivalent to its inputs by construction. No self-citations are invoked as load-bearing premises, and the central claims (disentangled control, forward/inverse reasoning) are positioned as outcomes of standard supervised or self-supervised training rather than tautological redefinitions. This is the expected non-finding for a data-driven generative model without explicit mathematical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed from abstract only; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly assumes that video data encodes learnable causal structure and that cross-view attention can faithfully transport motion signals.

pith-pipeline@v0.9.0 · 5544 in / 1348 out tokens · 44797 ms · 2026-05-10T17:34:08.321477+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

91 extracted references · 38 canonical work pages · 16 internal anchors

  1. [1]

    J. Bai, M. Xia, X. Fu, X. Wang, L. Mu, J. Cao, Z. Liu, H. Hu, X. Bai, P. Wan, et al. Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025. 4, 8

  2. [2]

    J. Bai, M. Xia, X. Wang, Z. Yuan, X. Fu, Z. Liu, H. Hu, P. Wan, and D. Zhang. Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints.Proc. ICLR, 2025. 4, 7, 8

  3. [3]

    S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, Q....

  4. [4]

    Bansal, Z

    H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y. Bitton, C. Jiang, Y. Sun, K.-W. Chang, and A. Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024. 8, 18

  5. [5]

    Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

    H. Bansal, C. Peng, Y. Bitton, R. Goldenberg, A. Grover, and K.-W. Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025. 18

  6. [6]

    A. Bar, G. Zhou, D. Tran, T. Darrell, and Y. LeCun. Navigation world models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15791–15801, 2025. 2

  7. [7]

    Blattmann, R

    A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023. 2

  8. [8]

    Brooks, B

    T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh. Video generation models as world simulators.OpenAI technical reports, 2024. 2

  9. [9]

    Q. Bu, J. Cai, L. Chen, X. Cui, Y. Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669,

  10. [10]

    Burgert, Y

    R. Burgert, Y. Xu, W. Xian, O. Pilarski, P. Clausen, M. He, L. Ma, Y. Deng, L. Li, M. Mousavi, et al. Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13–23, 2025. 2

  11. [11]

    B. Chen, H. Jiang, S. Liu, S. Gupta, Y. Li, H. Zhao, and S. Wang. Physgen3d: Crafting a miniature interactive world from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6178–6189, 2025. 2, 3

  12. [12]

    H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models.arXiv preprint arXiv:2401.09047, 2024. 2

  13. [13]

    T.-S. Chen, A. Siarohin, W. Menapace, E. Deyneka, H.-w. Chao, B. E. Jeon, Y. Fang, H.-Y. Lee, J. Ren, M.-H. Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024. 8

  14. [14]

    Y. Chen, Y. Men, Y. Yao, M. Cui, and L. Bo. Perception-as-control: Fine-grained controllable image animation with 3d-aware motion representation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14380–14389, 2025. 3

  15. [15]

    R. Chu, Y. He, Z. Chen, S. Zhang, X. Xu, B. Xia, D. Wang, H. Yi, X. Liu, H. Zhao, et al. Wan-move: Motion-controllable video generation via latent trajectory guidance.arXiv preprint arXiv:2512.08765, 2025. 2, 3, 8, 9, 10, 11, 13, 21, 22

  16. [16]

    T. Cosmos. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 4 23 Appendix

  17. [17]

    Doersch, Y

    C. Doersch, Y. Yang, M. Vecerik, D. Gokay, A. Gupta, Y. Aytar, J. Carreira, and A. Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10061–10072, 2023. 3

  18. [18]

    Elfwing, E

    S. Elfwing, E. Uchibe, and K. Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018. 16

  19. [19]

    Finn and S

    C. Finn and S. Levine. Deep visual foresight for planning robot motion. In2017 IEEE international conference on robotics and automation (ICRA), pages 2786–2793. IEEE, 2017. 6

  20. [21]

    X. Fu, S. Tang, M. Shi, X. Liu, J. Gu, M.-Y. Liu, D. Lin, and C.-H. Lin. Plenoptic video generation.arXiv preprint arXiv:2601.05239, 2026. 6

  21. [22]

    Q. Gao, Q. Xu, Z. Cao, B. Mildenhall, W. Ma, L. Chen, D. Tang, and U. Neumann. Gaussianflow: Splatting Gaussian dynamics for 4D content creation.arXiv preprint arXiv:2403.12365, 2024. 2

  22. [23]

    S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y. Dong, K. Mo, C.-H. Lin, Q. Ma, S. Nah, L. Magne, J. Xiang, Y. Xie, R. Zheng, D. Niu, Y. L. Tan, K. Zentner, G. Kurian, S. Indupuru, P. Jannaty, J. Gu, J. Zhang, J. Malik, P. Abbeel, M.-Y. Liu, Y. Zhu, J. Jang, and L. J. Fan. Dreamdojo: A generalist robot world model from large-scale human...

  23. [24]

    D. Geng, C. Herrmann, J. Hur, F. Cole, S. Zhang, T. Pfaff, T. Lopez-Guevara, C. Doersch, Y. Aytar, M. Rubinstein, C. Sun, O. Wang, A. Owens, and D. Sun. Motion prompting: Controlling video generation with motion trajectories. arXiv preprint arXiv:2412.02700, 2024. 2, 3, 5, 8, 9, 11

  24. [25]

    Force prompting: Video generation models can learn and generalize physics-based control signals.arXiv preprint arXiv:2505.19386, 2025

    N. Gillman, C. Herrmann, M. Freeman, D. Aggarwal, E. Luo, D. Sun, and C. Sun. Force prompting: Video generation models can learn and generalize physics-based control signals.arXiv preprint arXiv:2505.19386, 2025. 3

  25. [26]

    Z. Gu, R. Yan, J. Lu, P. Li, Z. Dou, C. Si, Z. Dong, Q. Liu, C. Lin, Z. Liu, et al. Diffusion as shader: 3d-aware video diffusion for versatile video generation control. InSIGGRAPH, 2025. 3

  26. [27]

    Dream to Control: Learning Behaviors by Latent Imagination

    D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. 2

  27. [28]

    Hafner, T

    D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019. 6

  28. [29]

    Mastering Diverse Domains through World Models

    D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023. 2

  29. [30]

    A. W. Harley, Y. You, X. Sun, Y. Zheng, N. Raghuraman, Y. Gu, S. Liang, W.-H. Chu, A. Dave, S. You, et al. Alltracker: Efficient dense point tracking at high resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5253–5262, 2025. 3, 6, 7

  30. [31]

    H. He, Y. Xu, Y. Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 8

  31. [32]

    X. He, C. Peng, Z. Liu, B. Wang, Y. Zhang, Q. Cui, F. Kang, B. Jiang, M. An, Y. Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025. 2

  32. [33]

    Heusel, H

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InProc. NeurIPS, 2017. 8

  33. [34]

    W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 2

  34. [35]

    Y. Hong, Y. Mei, C. Ge, Y. Xu, Y. Zhou, S. Bi, Y. Hold-Geoffroy, M. Roberts, M. Fisher, E. Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025. 2 24 Appendix

  35. [36]

    arXiv preprint arXiv:2508.10934 (2025)

    J. Huang, Q. Zhou, H. Rabeti, A. Korovko, H. Ling, X. Ren, T. Shen, J. Gao, D. Slepichev, C.-H. Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 3, 6, 7, 8, 16

  36. [37]

    Huang, W

    N. Huang, W. Zheng, C. Xu, K. Keutzer, S. Zhang, A. Kanazawa, and Q. Wang. Segment any motion in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3406–3416, 2025. 3

  37. [38]

    W. Jin, Q. Dai, C. Luo, S.-H. Baek, and S. Cho. Flovd: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis. InProc. CVPR, 2025. 3

  38. [39]

    Karaev, I

    N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker: It is better to track together. InProc. ECCV, 2024. 3

  39. [40]

    J. Kopf, X. Rong, and J.-B. Huang. Robust consistent video depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1611–1621, 2021. 3

  40. [41]

    Q. Li, Z. Xing, R. Wang, H. Zhang, Q. Dai, and Z. Wu. Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12112–12123,

  41. [42]

    Z. Li, S. Niklaus, N. Snavely, and O. Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. InCVPR, 2021. 3

  42. [43]

    Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V. Ye, A. Kanazawa, A. Holynski, and N. Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10486–10496, 2025. 3

  43. [44]

    Li, H.-X

    Z. Li, H.-X. Yu, W. Liu, Y. Yang, C. Herrmann, G. Wetzstein, and J. Wu. Wonderplay: Dynamic 3d scene generation from a single image and actions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9080–9090, 2025. 2, 3

  44. [45]

    L. Lian, B. Shi, A. Yala, T. Darrell, and B. Li. Llm-grounded video diffusion models.arXiv preprint arXiv:2309.17444,

  45. [46]

    Liang, B

    F. Liang, B. Wu, J. Wang, L. Yu, K. Li, Y. Zhao, I. Misra, J.-B. Huang, P. Zhang, P. Vajda, et al. Flowvid: Taming imperfect optical flows for consistent video-to-video synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8207–8216, 2024. 3

  46. [47]

    H. Lin, S. Chen, J. Liew, D. Y. Chen, Z. Li, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 5

  47. [48]

    Flow Matching for Generative Modeling

    Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 4

  48. [49]

    S. Liu, C. Guo, B. Zhou, and J. Wang. Ponimator: Unfolding interactive pose for versatile human-human interaction animation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12068–12077, 2025. 2, 3

  49. [50]

    S. Liu, Z. Ren, S. Gupta, and S. Wang. Physgen: Rigid-body physics-grounded image-to-video generation. InEuropean Conference on Computer Vision, pages 360–378. Springer, 2024. 2, 3

  50. [51]

    S. Liu, D. Y. Yao, S. Gupta, and S. Wang. Visual sync: Multi-camera synchronization via cross-view object motion. arXiv preprint arXiv:2512.02017, 2025. 3

  51. [52]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 8

  52. [53]

    J. Lv, Y. Huang, M. Yan, J. Huang, J. Liu, Y. Liu, Y. Wen, X. Chen, and S. Chen. Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1430–1440, 2024. 3

  53. [54]

    W.-D. K. Ma, J. P. Lewis, and W. B. Kleijn. Trailblazer: Trajectory control for diffusion-based video generation.arXiv preprint arXiv:2401.00896, 2023. 3 25 Appendix

  54. [55]

    Montanaro, L

    A. Montanaro, L. Savant Aira, E. Aiello, D. Valsesia, and E. Magli. Motioncraft: Physics-based zero-shot video generation.Advances in Neural Information Processing Systems, 37:123155–123181, 2024. 3

  55. [56]

    M. Niu, X. Cun, X. Wang, Y. Zhang, Y. Shan, and Y. Zheng. Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model. InEuropean conference on computer vision, pages 111–128. Springer, 2024. 3

  56. [57]

    C. Pan, B. Yaman, T. Nesti, A. Mallik, A. G. Allievi, S. Velipasalar, and L. Ren. Vlp: Vision language planning for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14760–14769, 2024. 3

  57. [58]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. 4

  58. [59]

    N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 6, 7, 16

  59. [60]

    X. Ren, T. Shen, J. Huang, H. Ling, Y. Lu, M. Nimier-David, T. Müller, A. Keller, S. Fidler, and J. Gao. Gen3c: 3d- informed world-consistent video generation with precise camera control. InCVPR, pages 6121–6132, 2025. 8, 9, 16

  60. [61]

    Rockwell, J

    C. Rockwell, J. Tung, T.-Y. Lin, M.-Y. Liu, D. F. Fouhey, and C.-H. Lin. Dynamic camera poses and where to find them. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12444–12455, 2025. 8, 9

  61. [62]

    X. Shi, Z. Huang, F.-Y. Wang, W. Bian, D. Li, Y. Zhang, M. Zhang, K. C. Cheung, S. See, H. Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024. 3

  62. [63]

    Siarohin, S

    A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe. First order motion model for image animation. In NeurIPS, 2019. 3

  63. [64]

    J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. 16

  64. [65]

    Tulyakov, M.-Y

    S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In CVPR, 2018. 3

  65. [66]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 8

  66. [67]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2, 4, 8, 9, 16

  67. [68]

    A. Wang, H. Huang, Z. Fang, Y. Yang, and C. Ma. Ati: Any trajectory instruction for controllable video generation. arXiv preprint, arXiv:2505.22944, 2025. 2, 8, 9, 10, 11, 13, 21, 22

  68. [69]

    J. Wang, A. Ma, K. Cao, J. Zheng, Z. Zhang, J. Feng, S. Liu, Y. Ma, B. Cheng, D. Leng, et al. Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025. 8, 11

  69. [70]

    J. Wang, Y. Zhang, J. Zou, Y. Zeng, G. Wei, L. Yuan, and H. Li. Boximator: Generating rich and controllable motions for video synthesis.arXiv preprint arXiv:2402.01566, 2024. 2, 3

  70. [71]

    R. Wang, S. Xu, C. Dai, J. Xiang, Y. Deng, X. Tong, and J. Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision, 2024. 17

  71. [72]

    X. Wang, Z. Zhu, G. Huang, X. Chen, J. Zhu, and J. Lu. Drivedreamer: Towards real-world-drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024. 2

  72. [73]

    Y. Wang, P. Bilinski, F. Bremond, and A. Dantcheva. G3AN: Disentangling appearance and motion for video generation. InCVPR, 2020. 3 26 Appendix

  73. [74]

    Y. Wang, F. Bremond, and A. Dantcheva. Inmodegan: Interpretable motion decomposition generative adversarial network for video generation.arXiv preprint arXiv:2101.03049, 2021. 3

  74. [75]

    Z. Wang, Z. Yuan, X. Wang, T. Chen, M. Xia, P. Luo, and Y. Shan. Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024. 2, 3

  75. [76]

    T.-H. Wu, L. Lian, J. E. Gonzalez, B. Li, and T. Darrell. Self-correcting llm-controlled diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6327–6336, 2024. 3

  76. [77]

    W. Wu, Z. Li, Y. Gu, R. Zhao, Y. He, D. J. Zhang, M. Z. Shou, Y. Li, T. Gao, and D. Zhang. Draganything: Motion control for anything using entity representation. InProc. ECCV, 2024. 2, 3

  77. [78]

    X. Wu, D. Paschalidou, J. Gao, A. Torralba, L. Leal-Taixé, O. Russakovsky, S. Fidler, and J. Lorraine. Where is motion from? scalable motion attribution for video generation models. In1st Workshop on Reliable and Interactive World Model in Computer Vision Non Archival, 2026. 3

  78. [79]

    J. Xing, L. Mai, C. Ham, J. Huang, A. Mahapatra, C.-W. Fu, T.-T. Wong, and F. Liu. Motioncanvas: Cinematic shot design with controllable image-to-video generation. InSIGGRAPH, 2025. 3

  79. [80]

    arXiv preprint arXiv:2310.061141(2), 6 (2023)

    M.Yang, Y.Du, K.Ghasemipour, J.Tompson, D.Schuurmans, andP.Abbeel. Learninginteractivereal-worldsimulators. arXiv preprint arXiv:2310.06114, 1(2):6, 2023. 2

  80. [81]

    Vlipp: Towardsphysicallyplausible video generation with vision and language informed physical prior

    X.Yang,B.Li,Y.Zhang,Z.Yin,L.Bai,L.Ma,Z.Wang,J.Cai,T.-T.Wong,H.Lu,etal. Vlipp: Towardsphysicallyplausible video generation with vision and language informed physical prior. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12360–12370, 2025. 3

Showing first 80 references.