OrthoMotion:Disentangling Camera and Subject Motion via Geometry Semantics Orthogonal Attention

Zijie Meng

arxiv: 2606.22835 · v1 · pith:2YQSWGCDnew · submitted 2026-06-22 · 💻 cs.CV · cs.AI

OrthoMotion:Disentangling Camera and Subject Motion via Geometry Semantics Orthogonal Attention

Zijie Meng This is my paper

Pith reviewed 2026-06-26 09:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords motion disentanglementvideo generationattention mechanismcamera controlsubject motionorthogonal subspacesoptical flowrotary embeddings

0 comments

The pith

OrthoMotion routes camera motion to a rotation of rotary embeddings and subject motion to gated cross-attention so a regularizer can force the two response subspaces to orthogonality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that camera and subject motion cannot be separated from 2D image evidence alone because both produce optical flow scaled by the same inverse depth. It reframes the problem as one of attention operator design rather than network architecture. Camera motion is assigned to a norm-preserving rotation of the position embedding phase while subject motion is assigned to a gated value update in cross-attention. Because these two operations act as algebraically complementary parts of an affine transformation on tokens, a simple regularizer can drive their output subspaces to be orthogonal. The result is independent control of camera and subject with measurable reduction in interference across backbones.

Core claim

The entanglement of camera-induced and subject-induced motion is representational rather than architectural; the 2D split is a non-identifiable inverse problem. OrthoMotion resolves it by mapping camera motion to a geometric channel consisting of a norm-preserving rotation of the RoPE phase and subject motion to a semantic channel consisting of gated value injection in cross-attention. These sub-operators are complementary, so a lightweight decoupling regularizer provably drives their response subspaces to orthogonality and thereby guarantees disentanglement by construction.

What carries the argument

The OrthoMotion attention operator that splits camera motion into a norm-preserving rotation of rotary position embeddings and subject motion into gated value injection, with a regularizer enforcing orthogonality of the resulting subspaces.

If this is right

Camera and subject controls reach state-of-the-art accuracy at the same time.
Cross-talk between the two controls drops by more than 2.4 times on the new Cross-Talk Error metric.
Disentanglement holds without loss of generation fidelity.
The separation generalizes across different backbone networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same complementary-operator pattern could be applied to separate other entangled factors such as lighting direction from object texture.
If the orthogonality is stable, the method might support real-time interactive video editing where camera and subject are adjusted independently.
The Cross-Talk Error metric itself could serve as a diagnostic for any motion-conditioned generator.

Load-bearing premise

That camera and subject motions produce optical flows that share the same inverse-depth scaling yet can still be routed into rotation versus gated translation to create subspaces a regularizer can make orthogonal.

What would settle it

A direct measurement of the inner product between the camera-channel and subject-channel response subspaces remaining above a small threshold after training, or a cross-talk error that does not fall when the regularizer weight is increased.

Figures

Figures reproduced from arXiv: 2606.22835 by Zijie Meng.

**Figure 1.** Figure 1: Overview of OrthoMotion. The geometric channel ϕg injects a norm-preserving phase Ω(gt) ∈ SO(dh) into RoPE, while the semantic channel ϕs fuses subject-trajectory tokens Zτ , gated by Mt(x), into cross-attention values. A regularizer L⊥ = ∥Jb⊤ g Jbτ ∥ 2 F enforces orthogonal camera/subject response subspaces. Decoupled Regime Static Camera Strong Camera Orbit 20 30 40 50 60 Subject ObjMC (Lower is better) … view at source ↗

**Figure 2.** Figure 2: Decoupling at a glance. Subject error (ObjMC) vs. camera-motion magnitude; OrthoMotion stays nearly flat (∆ ≈ +2 px) while baselines entangle (∆≥+25 px). generalizing across multiple pose-conditioned backbones. 2 Related Work Motion control in video generation. Camera and subject control have largely evolved on separate tracks [17–22]. On the camera side, CameraCtrl conditions a frozen diffusion model on … view at source ↗

read the original abstract

Controllable video generation demands independent command of the camera and the subject, yet 2D conditioning entangles them: camera- and object-induced optical flow share the same inverse-depth (1/Z) scaling and cannot be separated from image evidence alone. We first prove that this entanglement is representational, not architectural -- the 2D camera/object split is a non-identifiable inverse problem -- and therefore reframe decoupling as a question of operator design. We resolve it at the level of the attention operator. OrthoMotion routes camera motion into a geometric channel, a norm-preserving rotation of the rotary position embedding (RoPE) phase, and subject motion into a semantic channel, a gated value injection in cross-attention. Because these sub-operators are algebraically complementary -- a rotation versus a translation of the affine action on tokens -- a lightweight decoupling regularizer provably drives their response subspaces to orthogonality, so the two controls stop interfering. To our knowledge OrthoMotion is the first method to guarantee disentanglement by construction rather than hope for it to emerge. It attains state-of-the-art camera and subject accuracy at once while minimizing cross-talk, which we quantify with a new Cross-Talk Error (CTE) metric, cutting cross-talk by more than 2.4x with no loss in fidelity and generalizing across backbones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The operator split into RoPE rotation for camera and gated injection for subject motion is a concrete idea, but the claimed algebraic proof and 2.4x CTE gains stay unverified without the full equations and experiments.

read the letter

The one thing to know is that OrthoMotion routes camera motion to a norm-preserving RoPE phase rotation and subject motion to gated value injection in cross-attention, then adds a lightweight regularizer to force the response subspaces orthogonal. The abstract frames this as the first guarantee-by-construction approach rather than an emergent property.

The reframing of the 2D entanglement as a representational non-identifiable inverse problem due to shared 1/Z scaling is clear and leads directly to the operator design. Introducing the Cross-Talk Error metric gives a concrete way to measure interference, which is useful even if the numbers need checking.

The soft spots are the missing pieces. The abstract asserts that the sub-operators are algebraically complementary and that the regularizer provably drives orthogonality, yet no derivation, equations, or controls appear. The full attention computation with shared Q/K and softmax could break the claimed complementarity, and the performance claims (SOTA accuracy plus 2.4x CTE cut with no fidelity loss) are stated without baselines or ablations here. If the regularizer only achieves approximate separation, the guarantee-by-construction claim does not hold.

This is for people working on controllable video generation who need independent camera and subject control. A reader already experimenting with attention modifications in generative models could extract the split idea and test it.

It deserves peer review so the full manuscript can be checked for the promised proof and reproducible results.

Referee Report

2 major / 2 minor

Summary. The paper claims that camera and subject motion entanglement in 2D video conditioning is representational due to shared 1/Z scaling, making it a non-identifiable inverse problem. It proposes OrthoMotion which routes camera motion to RoPE phase rotation and subject motion to gated value injection in cross-attention. These sub-operators are said to be algebraically complementary, allowing a lightweight regularizer to provably enforce orthogonality of response subspaces, guaranteeing disentanglement by construction. The method achieves SOTA accuracy with over 2.4x reduction in a new Cross-Talk Error (CTE) metric without loss in fidelity and generalizes across backbones.

Significance. If the claimed algebraic complementarity and provable orthogonality via the regularizer hold, this would be a notable advance in the field of controllable video generation, providing a principled operator-level solution to motion disentanglement instead of hoping for it to emerge from training. The introduction of the CTE metric for quantifying cross-talk could be useful for the community. The reported performance improvements suggest practical benefits for applications requiring independent control of camera and subject.

major comments (2)

Abstract: The abstract asserts a proof that the entanglement is representational, that the sub-operators are algebraically complementary, and that the regularizer provably drives orthogonality, yet supplies none of the derivation, equations, or experimental controls; these are load-bearing for the central claim of 'guarantee by construction'.
Method section on operator design: The assumption that routing camera motion exclusively to norm-preserving RoPE phase rotation while routing subject motion to gated value injection produces algebraically complementary operators (rotation vs. translation of the affine action on tokens) such that a regularizer forces response subspaces to orthogonality must be shown explicitly after accounting for the full attention computation including shared Q/K projections and softmax; without this the complementarity remains unverified.

minor comments (2)

Title: Missing space: 'OrthoMotion:Disentangling' should read 'OrthoMotion: Disentangling'.
Abstract: The CTE metric and 2.4x reduction claim would benefit from a brief definition or reference to its computation formula.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the need for clearer presentation of our theoretical claims. We address each major comment below. Where the manuscript requires additional explicit derivations or cross-references, we will revise accordingly.

read point-by-point responses

Referee: Abstract: The abstract asserts a proof that the entanglement is representational, that the sub-operators are algebraically complementary, and that the regularizer provably drives orthogonality, yet supplies none of the derivation, equations, or experimental controls; these are load-bearing for the central claim of 'guarantee by construction'.

Authors: The abstract is intentionally concise and therefore omits the full derivations. The representational non-identifiability proof appears in Section 3.1 with the explicit 1/Z scaling argument and inverse-problem formulation. Algebraic complementarity between the RoPE rotation operator and the gated-value translation operator is derived in Section 3.2, and the regularizer's effect on response-subspace orthogonality (including the relevant inner-product bound) is shown in Section 3.3. Experimental controls that isolate the contribution of the regularizer are reported in Section 4.3 and the supplementary material. To improve accessibility we will add a single sentence to the abstract that points readers to these sections and will include a short proof sketch in the introduction. revision: yes
Referee: Method section on operator design: The assumption that routing camera motion exclusively to norm-preserving RoPE phase rotation while routing subject motion to gated value injection produces algebraically complementary operators (rotation vs. translation of the affine action on tokens) such that a regularizer forces response subspaces to orthogonality must be shown explicitly after accounting for the full attention computation including shared Q/K projections and softmax; without this the complementarity remains unverified.

Authors: We agree that an explicit end-to-end accounting is required. The current Section 3.2 derives complementarity at the level of the two sub-operators before the shared Q/K projections and softmax. We will revise this section to insert the full forward pass: (i) the effect of the RoPE phase rotation on the query-key dot products, (ii) the additive gated-value term after the value projection, and (iii) the subsequent softmax normalization. We will then show that the cross-term between the two channels remains zero under the regularizer even after these operations, thereby confirming that the subspaces stay orthogonal. This expanded derivation will be placed immediately after the current operator definitions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained architectural design

full rationale

The paper asserts a representational non-identifiability proof for the 2D camera/subject split and then proposes an explicit operator split (RoPE rotation for camera, gated value injection for subject) whose algebraic complementarity is claimed to allow a regularizer to enforce orthogonality. No quoted equations, self-citations, or fitted parameters are shown reducing the 'guarantee by construction' to an input quantity or prior author result. The central claim rests on the design choice and internal proof rather than tautological re-use of outputs as inputs. This is the normal case of an independent method proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is abstract-only; the central claim rests on the unshown proof that the 2D split is non-identifiable and on the algebraic complementarity of rotation versus gated translation, neither of which can be inspected.

axioms (1)

domain assumption Camera- and object-induced optical flow share the same inverse-depth (1/Z) scaling and cannot be separated from image evidence alone.
Stated as the starting premise in the abstract.

pith-pipeline@v0.9.1-grok · 5766 in / 1394 out tokens · 19685 ms · 2026-06-26T09:30:14.415178+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 7 linked inside Pith

[1]

Robust single image sand removal by leveraging uncertainty-aware sam priors and prompt learning with refined perceptual loss,

B. Wei, H. Liu, C. Qian, Z. Li, W. Wu, and Z. Meng, “Robust single image sand removal by leveraging uncertainty-aware sam priors and prompt learning with refined perceptual loss,” inProceedings of the Backbone CTEbasec→s↓CTEOursc→s↓ObjMC↓FVD↓ Wan2.1-1.3B 24.1 4.6 19.8 142 Wan2.1-14B 22.8 4.1 17.9 121 CogVideoX-2B 25.3 5.2 21.4 151 Table 5:Generator-agnost...

2025
[2]

Rusid: Robust uncertainty-aware single image deraining be- yond certainty,

B. Wei, H. Liu, C. Qian, Z. Li, and Z. Meng, “Rusid: Robust uncertainty-aware single image deraining be- yond certainty,”
[3]

Synpo: Boosting training-free few-shot medical segmentation via high-quality neg- ative prompts,

Y. Liu, H. Xiao, J. Chai, Y. Zhang, R. Wang, Z. Meng, and Z. Luo, “Synpo: Boosting training-free few-shot medical segmentation via high-quality neg- ative prompts,” inInternational Conference on Med- ical Image Computing and Computer-Assisted Inter- vention, pp. 594–603, Springer, 2025

2025
[4]

Orpaint: a zero- shot inpainting model for oracle bone inscription rubbings with visual mamba block,

Z. Meng, Y. Zeng, X. Chang, T. Xu, F. Chao, X. Cao, C. Shang, and Q. Shen, “Orpaint: a zero- shot inpainting model for oracle bone inscription rubbings with visual mamba block,”Science China Information Sciences, vol. 68, no. 8, p. 189102, 2025

2025
[5]

Make a game: A novel paradigm for interactive game ren- dering,

Z. Meng, J. Che, B. Wei, and X. Cao, “Make a game: A novel paradigm for interactive game ren- dering,” inICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pp. 1026–1030, IEEE, 2026

2026
[6]

Argus: Stacked multi-view identity mosaic injection for subject-preserving video generation,

Z. Meng, J. Liu, Y. Liu, C. Tong, X. Liu, Y. Zhang, Y. Xu, and P. Wan, “Argus: Stacked multi-view identity mosaic injection for subject-preserving video generation,”arXiv preprint arXiv:2606.11670, 2026

Pith/arXiv arXiv 2026
[7]

CameraCtrl: Enabling camera control forvideogeneration,

H. He, Y. Xu, Y. Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang, “CameraCtrl: Enabling camera control forvideogeneration,” inInternational Conference on Learning Representations (ICLR), 2025

2025
[8]

Rerope: Repurposing rope for relative camera control,

C. Li, Y. Yang, J. Shao, H. Zhou, K. Schwarz, and Y. Liao, “Rerope: Repurposing rope for relative camera control,”arXiv preprint arXiv:2602.08068, 2026. 5

arXiv 2026
[9]

Cameras as relative positional en- coding,

R. Li, B. Yi, J. Liu, H. Gao, Y. Ma, and A. Kanazawa, “Cameras as relative positional en- coding,”Advances in Neural Information Processing Systems, vol. 38, pp. 15984–16009, 2026

2026
[10]

Dragnuwa: Fine-grained control in video generation by integrating text, image, and tra- jectory,

S. Yin, C. Wu, J. Liang, J. Shi, H. Li, G. Ming, and N. Duan, “Dragnuwa: Fine-grained control in video generation by integrating text, image, and tra- jectory,”arXiv preprint arXiv:2308.08089, 2023

Pith/arXiv arXiv 2023
[11]

Tora: Trajectory-oriented diffusion transformer for video generation,

Z. Zhang, J. Liao, M. Li, Z. Dai, B. Qiu, S. Zhu, L. Qin, and W. Wang, “Tora: Trajectory-oriented diffusion transformer for video generation,” inPro- ceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 2063–2073, 2025

2063
[12]

Trident: Breaking the hybrid-safety- physics coupling for provably safe multi-agent rein- forcement learning,

Z. Meng, Z. Li, Y. Liu, Z. Li, J. Liu, W. Nie, B. Wei, and M. Zhang, “Trident: Breaking the hybrid-safety- physics coupling for provably safe multi-agent rein- forcement learning,” 2026

2026
[13]

Motionctrl: A unified and flexible motion controller for video generation,

Z. Wang, Z. Yuan, X. Wang, Y. Li, T. Chen, M. Xia, P. Luo, and Y. Shan, “Motionctrl: A unified and flexible motion controller for video generation,” in ACM SIGGRAPH 2024 Conference Papers, pp. 1– 11, 2024

2024
[14]

Motionpro: A precise motion controller for image-to-video generation,

Z. Zhang, F. Long, Z. Qiu, Y. Pan, W. Liu, T. Yao, and T. Mei, “Motionpro: A precise motion controller for image-to-video generation,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, pp. 27957–27967, 2025

2025
[15]

Wan: Open and advanced large-scale video gen- erative models,

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.- W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang,et al., “Wan: Open and advanced large-scale video gen- erative models,”arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[16]

Parascale: Scale-calibrated camera- motion transfer via a gauge-invariant parallax num- ber,

Z. Meng, “Parascale: Scale-calibrated camera- motion transfer via a gauge-invariant parallax num- ber,” 2026

2026
[17]

Camer- aCtrl II: Dynamic scene exploration via camera- controlled video diffusion models,

H. He, C. Yang, S. Lin, Y. Xu,et al., “Camer- aCtrl II: Dynamic scene exploration via camera- controlled video diffusion models,”arXiv preprint arXiv:2503.10592, 2025

arXiv 2025
[18]

Cine- master: A 3d-aware and controllable framework for cinematic text-to-video generation,

Q. Wang, Y. Luo, X. Shi, X. Jia, H. Lu, T. Xue, X. Wang, P. Wan, D. Zhang, and K. Gai, “Cine- master: A 3d-aware and controllable framework for cinematic text-to-video generation,” inProceedings of the Special Interest Group on Computer Graph- ics and Interactive Techniques Conference Confer- ence Papers, pp. 1–10, 2025

2025
[19]

Motionmaster: Training-free camera motion transfer for video generation,

T.Hu, J.Zhang, R.Yi, Y.Wang, H.Huang, J.Weng, Y. Wang, and L. Ma, “Motionmaster: Training-free camera motion transfer for video generation,”arXiv preprint arXiv:2404.15789, 2024

arXiv 2024
[20]

Nvs-solver: Video diffusion model as zero-shot novel view syn- thesizer,

M. You, Z. Zhu, H. Liu, and J. Hou, “Nvs-solver: Video diffusion model as zero-shot novel view syn- thesizer,”arXiv preprint arXiv:2405.15364, 2024

arXiv 2024
[21]

Omni- director: General multi-shot camera cloning without cross-paireddata,

J. Liu, S. Li, Z. Fang, X. Li, Y. Zhou, Z. Meng, Z.Zhang, Y.Luo, G.Zhang, Y.-S.Liu,et al., “Omni- director: General multi-shot camera cloning without cross-paireddata,”arXiv preprint arXiv:2606.13432, 2026

Pith/arXiv arXiv 2026
[22]

Hartley and A

R. Hartley and A. Zisserman,Multiple View Ge- ometry in Computer Vision. Cambridge University Press, 2nd ed., 2004

2004
[23]

Scalable diffusion mod- els with transformers,

W. Peebles and S. Xie, “Scalable diffusion mod- els with transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4195–4205, 2023

2023
[24]

Roformer: Enhanced transformer with rotary position embedding,

J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,”arXiv preprint arXiv:2104.09864, 2021

Pith/arXiv arXiv 2021
[25]

Video- composer: Compositional video synthesis with mo- tioncontrollability,

X. Wang, H. Yuan, S. Zhang, D. Chen, J. Wang, Y. Zhang, Y. Shen, D. Zhao, and J. Zhou, “Video- composer: Compositional video synthesis with mo- tioncontrollability,”Advances in Neural Information Processing Systems, vol. 36, pp. 7594–7611, 2023

2023
[26]

Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance,

Q. Li, Z. Xing, R. Wang, H. Zhang, Q. Dai, and Z. Wu, “Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance,”arXiv preprint arXiv:2503.16421, 2025

arXiv 2025
[27]

To- kenflow: Consistent diffusion features for consistent video editing,

M. Geyer, O. Bar-Tal, S. Bagon, and T. Dekel, “To- kenflow: Consistent diffusion features for consistent video editing,”arXiv preprint arXiv:2307.10373, 2023

Pith/arXiv arXiv 2023
[28]

Dive: Dit-based video generation with enhanced control,

J. Jiang, G. Hong, L. Zhou, E. Ma, H. Hu, X. Zhou, J. Xiang, F. Liu, K. Yu, H. Sun,et al., “Dive: Dit-based video generation with enhanced control,” arXiv preprint arXiv:2409.01595, 2024

arXiv 2024
[29]

Flowmatchingforgenerativemodeling,

Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, andM.Le, “Flowmatchingforgenerativemodeling,” inInternational Conference on Learning Representa- tions (ICLR), 2023

2023
[30]

Towards accurate generative models of video: A new met- ric & challenges,

T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Towards accurate generative models of video: A new met- ric & challenges,”arXiv preprint arXiv:1812.01717, 2018

Pith/arXiv arXiv 2018
[31]

Learning transferable visual models from natural language su- pervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G.Goh, S.Agarwal, G.Sastry, A.Askell, P.Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language su- pervision,” inInternational Conference on Machine Learning (ICML), pp. 8748–8763, 2021. 6

2021

[1] [1]

Robust single image sand removal by leveraging uncertainty-aware sam priors and prompt learning with refined perceptual loss,

B. Wei, H. Liu, C. Qian, Z. Li, W. Wu, and Z. Meng, “Robust single image sand removal by leveraging uncertainty-aware sam priors and prompt learning with refined perceptual loss,” inProceedings of the Backbone CTEbasec→s↓CTEOursc→s↓ObjMC↓FVD↓ Wan2.1-1.3B 24.1 4.6 19.8 142 Wan2.1-14B 22.8 4.1 17.9 121 CogVideoX-2B 25.3 5.2 21.4 151 Table 5:Generator-agnost...

2025

[2] [2]

Rusid: Robust uncertainty-aware single image deraining be- yond certainty,

B. Wei, H. Liu, C. Qian, Z. Li, and Z. Meng, “Rusid: Robust uncertainty-aware single image deraining be- yond certainty,”

[3] [3]

Synpo: Boosting training-free few-shot medical segmentation via high-quality neg- ative prompts,

Y. Liu, H. Xiao, J. Chai, Y. Zhang, R. Wang, Z. Meng, and Z. Luo, “Synpo: Boosting training-free few-shot medical segmentation via high-quality neg- ative prompts,” inInternational Conference on Med- ical Image Computing and Computer-Assisted Inter- vention, pp. 594–603, Springer, 2025

2025

[4] [4]

Orpaint: a zero- shot inpainting model for oracle bone inscription rubbings with visual mamba block,

Z. Meng, Y. Zeng, X. Chang, T. Xu, F. Chao, X. Cao, C. Shang, and Q. Shen, “Orpaint: a zero- shot inpainting model for oracle bone inscription rubbings with visual mamba block,”Science China Information Sciences, vol. 68, no. 8, p. 189102, 2025

2025

[5] [5]

Make a game: A novel paradigm for interactive game ren- dering,

Z. Meng, J. Che, B. Wei, and X. Cao, “Make a game: A novel paradigm for interactive game ren- dering,” inICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pp. 1026–1030, IEEE, 2026

2026

[6] [6]

Argus: Stacked multi-view identity mosaic injection for subject-preserving video generation,

Z. Meng, J. Liu, Y. Liu, C. Tong, X. Liu, Y. Zhang, Y. Xu, and P. Wan, “Argus: Stacked multi-view identity mosaic injection for subject-preserving video generation,”arXiv preprint arXiv:2606.11670, 2026

Pith/arXiv arXiv 2026

[7] [7]

CameraCtrl: Enabling camera control forvideogeneration,

H. He, Y. Xu, Y. Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang, “CameraCtrl: Enabling camera control forvideogeneration,” inInternational Conference on Learning Representations (ICLR), 2025

2025

[8] [8]

Rerope: Repurposing rope for relative camera control,

C. Li, Y. Yang, J. Shao, H. Zhou, K. Schwarz, and Y. Liao, “Rerope: Repurposing rope for relative camera control,”arXiv preprint arXiv:2602.08068, 2026. 5

arXiv 2026

[9] [9]

Cameras as relative positional en- coding,

R. Li, B. Yi, J. Liu, H. Gao, Y. Ma, and A. Kanazawa, “Cameras as relative positional en- coding,”Advances in Neural Information Processing Systems, vol. 38, pp. 15984–16009, 2026

2026

[10] [10]

Dragnuwa: Fine-grained control in video generation by integrating text, image, and tra- jectory,

S. Yin, C. Wu, J. Liang, J. Shi, H. Li, G. Ming, and N. Duan, “Dragnuwa: Fine-grained control in video generation by integrating text, image, and tra- jectory,”arXiv preprint arXiv:2308.08089, 2023

Pith/arXiv arXiv 2023

[11] [11]

Tora: Trajectory-oriented diffusion transformer for video generation,

Z. Zhang, J. Liao, M. Li, Z. Dai, B. Qiu, S. Zhu, L. Qin, and W. Wang, “Tora: Trajectory-oriented diffusion transformer for video generation,” inPro- ceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 2063–2073, 2025

2063

[12] [12]

Trident: Breaking the hybrid-safety- physics coupling for provably safe multi-agent rein- forcement learning,

Z. Meng, Z. Li, Y. Liu, Z. Li, J. Liu, W. Nie, B. Wei, and M. Zhang, “Trident: Breaking the hybrid-safety- physics coupling for provably safe multi-agent rein- forcement learning,” 2026

2026

[13] [13]

Motionctrl: A unified and flexible motion controller for video generation,

Z. Wang, Z. Yuan, X. Wang, Y. Li, T. Chen, M. Xia, P. Luo, and Y. Shan, “Motionctrl: A unified and flexible motion controller for video generation,” in ACM SIGGRAPH 2024 Conference Papers, pp. 1– 11, 2024

2024

[14] [14]

Motionpro: A precise motion controller for image-to-video generation,

Z. Zhang, F. Long, Z. Qiu, Y. Pan, W. Liu, T. Yao, and T. Mei, “Motionpro: A precise motion controller for image-to-video generation,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, pp. 27957–27967, 2025

2025

[15] [15]

Wan: Open and advanced large-scale video gen- erative models,

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.- W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang,et al., “Wan: Open and advanced large-scale video gen- erative models,”arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[16] [16]

Parascale: Scale-calibrated camera- motion transfer via a gauge-invariant parallax num- ber,

Z. Meng, “Parascale: Scale-calibrated camera- motion transfer via a gauge-invariant parallax num- ber,” 2026

2026

[17] [17]

Camer- aCtrl II: Dynamic scene exploration via camera- controlled video diffusion models,

H. He, C. Yang, S. Lin, Y. Xu,et al., “Camer- aCtrl II: Dynamic scene exploration via camera- controlled video diffusion models,”arXiv preprint arXiv:2503.10592, 2025

arXiv 2025

[18] [18]

Cine- master: A 3d-aware and controllable framework for cinematic text-to-video generation,

Q. Wang, Y. Luo, X. Shi, X. Jia, H. Lu, T. Xue, X. Wang, P. Wan, D. Zhang, and K. Gai, “Cine- master: A 3d-aware and controllable framework for cinematic text-to-video generation,” inProceedings of the Special Interest Group on Computer Graph- ics and Interactive Techniques Conference Confer- ence Papers, pp. 1–10, 2025

2025

[19] [19]

Motionmaster: Training-free camera motion transfer for video generation,

T.Hu, J.Zhang, R.Yi, Y.Wang, H.Huang, J.Weng, Y. Wang, and L. Ma, “Motionmaster: Training-free camera motion transfer for video generation,”arXiv preprint arXiv:2404.15789, 2024

arXiv 2024

[20] [20]

Nvs-solver: Video diffusion model as zero-shot novel view syn- thesizer,

M. You, Z. Zhu, H. Liu, and J. Hou, “Nvs-solver: Video diffusion model as zero-shot novel view syn- thesizer,”arXiv preprint arXiv:2405.15364, 2024

arXiv 2024

[21] [21]

Omni- director: General multi-shot camera cloning without cross-paireddata,

J. Liu, S. Li, Z. Fang, X. Li, Y. Zhou, Z. Meng, Z.Zhang, Y.Luo, G.Zhang, Y.-S.Liu,et al., “Omni- director: General multi-shot camera cloning without cross-paireddata,”arXiv preprint arXiv:2606.13432, 2026

Pith/arXiv arXiv 2026

[22] [22]

Hartley and A

R. Hartley and A. Zisserman,Multiple View Ge- ometry in Computer Vision. Cambridge University Press, 2nd ed., 2004

2004

[23] [23]

Scalable diffusion mod- els with transformers,

W. Peebles and S. Xie, “Scalable diffusion mod- els with transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4195–4205, 2023

2023

[24] [24]

Roformer: Enhanced transformer with rotary position embedding,

J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,”arXiv preprint arXiv:2104.09864, 2021

Pith/arXiv arXiv 2021

[25] [25]

Video- composer: Compositional video synthesis with mo- tioncontrollability,

X. Wang, H. Yuan, S. Zhang, D. Chen, J. Wang, Y. Zhang, Y. Shen, D. Zhao, and J. Zhou, “Video- composer: Compositional video synthesis with mo- tioncontrollability,”Advances in Neural Information Processing Systems, vol. 36, pp. 7594–7611, 2023

2023

[26] [26]

Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance,

Q. Li, Z. Xing, R. Wang, H. Zhang, Q. Dai, and Z. Wu, “Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance,”arXiv preprint arXiv:2503.16421, 2025

arXiv 2025

[27] [27]

To- kenflow: Consistent diffusion features for consistent video editing,

M. Geyer, O. Bar-Tal, S. Bagon, and T. Dekel, “To- kenflow: Consistent diffusion features for consistent video editing,”arXiv preprint arXiv:2307.10373, 2023

Pith/arXiv arXiv 2023

[28] [28]

Dive: Dit-based video generation with enhanced control,

J. Jiang, G. Hong, L. Zhou, E. Ma, H. Hu, X. Zhou, J. Xiang, F. Liu, K. Yu, H. Sun,et al., “Dive: Dit-based video generation with enhanced control,” arXiv preprint arXiv:2409.01595, 2024

arXiv 2024

[29] [29]

Flowmatchingforgenerativemodeling,

Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, andM.Le, “Flowmatchingforgenerativemodeling,” inInternational Conference on Learning Representa- tions (ICLR), 2023

2023

[30] [30]

Towards accurate generative models of video: A new met- ric & challenges,

T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Towards accurate generative models of video: A new met- ric & challenges,”arXiv preprint arXiv:1812.01717, 2018

Pith/arXiv arXiv 2018

[31] [31]

Learning transferable visual models from natural language su- pervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G.Goh, S.Agarwal, G.Sastry, A.Askell, P.Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language su- pervision,” inInternational Conference on Machine Learning (ICML), pp. 8748–8763, 2021. 6

2021