pith. sign in

arxiv: 2606.19805 · v1 · pith:RYC2O3F3new · submitted 2026-06-18 · 💻 cs.CV · cs.AI

ParaScale: Scale-Calibrated Camera-Motion Transfer via a Gauge-Invariant Parallax Number

Pith reviewed 2026-06-26 18:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords camera motion transferparallax numberscale calibrationgauge invariancevideo generationpose-conditioned modelsmonocular depthtrajectory transfer
0
0 comments X

The pith

Scale-faithful camera-motion transfer requires preserving a dimensionless Parallax Number rather than the raw trajectory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that monocular camera trajectories are only meaningful up to an unknown depth scale because translation-induced image motion scales with ||T||/Z. It defines the Parallax Number Pi as ||Delta T|| divided by average depth Zbar, a single scalar that stays constant when both the trajectory and the scene are scaled by the same factor. ParaScale extracts this number from any reference video and re-realizes it against the target scene's own depth map on a per-frame basis while leaving rotation unchanged. The approach sits between pose extraction and pose injection, requires no retraining, and works with any pose-conditioned video generator. It also introduces the Parallax Consistency Error metric that directly measures scale mismatch, unlike similarity-aligned trajectory errors.

Core claim

The central claim is that scale-faithful transfer must preserve the gauge-invariant Parallax Number Pi = ||Delta T|| / Zbar per frame instead of the raw trajectory, because this single dimensionless quantity encodes how strongly the camera move is felt in the image plane.

What carries the argument

The Parallax Number Pi, a dimensionless ratio of translation magnitude to average scene depth that remains invariant under uniform scaling of the entire scene and trajectory.

If this is right

  • Transfer remains consistent when reference and target scenes differ in linear scale by four orders of magnitude.
  • The same reference trajectory can be applied to both microscopic and astronomical scenes without manual rescaling.
  • Pose-conditioned generators can accept the calibrated poses directly, with no change to their training or architecture.
  • The Parallax Consistency Error metric reveals scale mismatch that similarity-aligned trajectory error hides.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gauge-invariant scalar could be used to calibrate motion transfer in non-video domains such as 3D scene editing or robotics trajectory replay.
  • Because rotation is left untouched, the method implicitly assumes that rotational parallax effects are negligible or handled separately by the underlying generator.
  • If depth maps in the target scene contain large errors, the realized Pi will deviate even when the method runs correctly, suggesting a natural extension that jointly refines depth and motion.

Load-bearing premise

Image motion caused by translation is exactly proportional to translation magnitude over depth, and matching the single scalar Pi per frame is enough to achieve faithful transfer while rotation can be copied unchanged.

What would settle it

Generate videos from the same reference motion but with target scenes whose average depth differs by a known factor; if the realized image motion strength does not match the reference after ParaScale adjustment, the claim that preserving Pi suffices is false.

Figures

Figures reproduced from arXiv: 2606.19805 by Zijie Meng.

Figure 1
Figure 1. Figure 1: ParaScale. From the reference we read the gauge-invariant Parallax Number Πt (Eq. 2); the target scene supplies its own depth Z¯tgt t ; a per-frame translational gain αt re-realizes Πt in the target while rotation passes through unchanged (Eq. 5). The module is a drop-in pre-processor before a frozen pose-conditioned generator. The dashed path is the uncalibrated baseline that ships raw translation and bre… view at source ↗
Figure 2
Figure 2. Figure 2: Left: realized vs. intended Parallax Num￾ber (log–log); ParaScale tracks the identity line while raw transfer under-drives cosmic and over-drives table￾top scenes. Right: PCE per scale regime—ParaScale stays low and flat. and “cosmic” astro/aerial), and transfer them to gen￾erations from a Plücker-conditioned controller [3] on a Wan2.1 backbone [20]. All methods share this back￾bone; only the trajectory fe… view at source ↗
read the original abstract

Transferring the camera motion of a reference video to a freshly generated one lets creators reuse cinematic moves. Yet reference and target often live at incompatible scales -- a sweep across a galaxy versus a nudge across a desk -- and naively reusing the recovered trajectory yields either imperceptible or violently exaggerated motion. We trace this to a geometric fact: translation-induced image motion scales as ||T||/Z, so a monocular trajectory is meaningful only up to a depth-scale gauge. We distill this into the Parallax Number Pi = ||Delta T|| / Zbar, a dimensionless, gauge-invariant descriptor of how strongly a camera move is felt, and prove that it -- not the raw trajectory -- is the quantity that scale-faithful transfer must preserve. ParaScale is a plug-and-play module that reads Pi off any reference video and re-realizes it against the target scene's own depth, per frame, leaving rotation untouched. Sitting between pose extraction and pose injection, it requires no retraining and drops into any pose-conditioned generator. We further introduce the Parallax Consistency Error (PCE), a scale-symmetric metric that -- unlike the similarity-aligned TransErr -- exposes scene-scale mismatch. Across scale regimes spanning four orders of magnitude and multiple backbones, ParaScale keeps the realized parallax on the identity line and cuts PCE by more than 3x over uncalibrated transfer with no loss of visual fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that monocular camera trajectories are only meaningful up to a depth-scale gauge because translation-induced image motion scales as ||T||/Z. It distills this into the Parallax Number Pi = ||ΔT|| / Z̄ (Z̄ = arithmetic mean depth), asserts a proof that scale-faithful transfer must preserve this gauge-invariant scalar (not the raw trajectory) while leaving rotation untouched, and presents ParaScale as a plug-and-play module that reads Pi from a reference video and re-realizes it against the target scene's depth per frame. It also introduces the scale-symmetric Parallax Consistency Error (PCE) metric and reports that ParaScale keeps realized parallax on the identity line across four orders of magnitude in scale, reduces PCE by >3× versus uncalibrated transfer, and incurs no visual-fidelity loss on multiple backbones.

Significance. If the derivation and definition of Pi are corrected, the work supplies a lightweight, training-free calibration layer that sits between any pose estimator and any pose-conditioned generator, addressing a practical mismatch that currently forces either imperceptible or exaggerated motion when reference and target scenes differ in scale. The PCE metric is a clear incremental strength because, unlike similarity-aligned TransErr, it is invariant to global scale and directly exposes the mismatch the method targets. No machine-checked proofs or parameter-free derivations are present.

major comments (1)
  1. [Definition of the Parallax Number / geometric premise] Definition of Pi (abstract and geometric-premise section): Pi is defined as ||ΔT|| / Z̄ with Z̄ the arithmetic mean of scene depths. Translation-induced image motion at pixel i is ||ΔT|| / Z_i, so the quantity that should be preserved to keep average parallax magnitude is ||ΔT|| · E[1/Z_i]. By Jensen's inequality, E[1/Z] ≥ 1/E[Z] with equality only when depth variance is zero. Consequently the stated Pi equals the correct average parallax only in the degenerate constant-depth case; in any scene with depth variation the preserved quantity systematically under-states the felt motion. This directly contradicts the central claim that Pi is the precise gauge-invariant descriptor whose preservation guarantees scale-faithful transfer.
minor comments (1)
  1. [Abstract / introduction] The abstract states that a proof is given that Pi (rather than the raw trajectory) must be preserved, yet no derivation steps, error bounds, or verification that per-frame re-realization maintains visual fidelity are visible. A short dedicated subsection laying out the steps would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive comment on the definition of the Parallax Number. The observation regarding Jensen's inequality is valid and we address it directly below, indicating the changes we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: Definition of Pi (abstract and geometric-premise section): Pi is defined as ||ΔT|| / Z̄ with Z̄ the arithmetic mean of scene depths. Translation-induced image motion at pixel i is ||ΔT|| / Z_i, so the quantity that should be preserved to keep average parallax magnitude is ||ΔT|| · E[1/Z_i]. By Jensen's inequality, E[1/Z] ≥ 1/E[Z] with equality only when depth variance is zero. Consequently the stated Pi equals the correct average parallax only in the degenerate constant-depth case; in any scene with depth variation the preserved quantity systematically under-states the felt motion. This directly contradicts the central claim that Pi is the precise gauge-invariant descriptor whose preservation guarantees scale-faithful transfer.

    Authors: We agree that the average parallax magnitude is strictly ||ΔT|| ⋅ E[1/Z_i] and that the arithmetic-mean formulation equals this quantity only when depth variance is zero. The original definition was selected for its simplicity in yielding a dimensionless, gauge-invariant scalar that is straightforward to compute from a reference video. Nevertheless, the referee's point is correct and directly impacts the precision of the central claim. In the revised manuscript we will redefine Pi as ||ΔT|| ⋅ E[1/Z_i] (equivalently, ||ΔT|| divided by the harmonic mean of the depths). We will update the abstract, the geometric-premise section, and the accompanying derivation to reflect this more accurate quantity while preserving the overall transfer procedure and the PCE metric. This change does not affect the empirical results or the plug-and-play nature of ParaScale. revision: yes

Circularity Check

0 steps flagged

No circularity; Parallax Number defined directly from geometric scaling relation

full rationale

The paper defines Pi = ||Delta T|| / Zbar explicitly from the stated geometric premise that translation-induced image motion scales as ||T||/Z, then claims this scalar is the gauge-invariant quantity that must be preserved for scale-faithful transfer. No equations, self-citations, or fitted parameters are visible that would reduce the claimed result to the inputs by construction. The derivation chain remains self-contained against external geometric benchmarks, with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the geometric scaling relation ||T||/Z and the assertion that a single scalar Pi suffices to control perceived motion strength; no free parameters or invented entities beyond the defined Pi are mentioned.

axioms (1)
  • domain assumption Translation-induced image motion scales as ||T||/Z
    Invoked in the abstract as the geometric fact that makes monocular trajectories scale-dependent.
invented entities (1)
  • Parallax Number Pi no independent evidence
    purpose: Gauge-invariant descriptor of camera-move strength
    Defined as ||Delta T|| / Zbar; no independent evidence outside the paper is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5779 in / 1370 out tokens · 22288 ms · 2026-06-26T18:32:35.654913+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 5 linked inside Pith

  1. [1]

    Tokenflow: Consistent diffusion fea- tures for consistent video editing.arXiv preprint arXiv:2307.10373, 2023

    Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion fea- tures for consistent video editing.arXiv preprint arXiv:2307.10373, 2023

  2. [2]

    Cambridge University Press, 2nd edition, 2004

    Richard Hartley and Andrew Zisserman.Multiple View Geometry in Computer Vision. Cambridge University Press, 2nd edition, 2004

  3. [3]

    CameraCtrl: Enabling camera control for video gen- eration

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wet- zstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. CameraCtrl: Enabling camera control for video gen- eration. InInternational Conference on Learning Representations (ICLR), 2025

  4. [4]

    Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, et al. CameraCtrl II: Dynamic scene exploration Variant PCE↓TransErr↓RotErr↓ globalα(no per-frame) 0.47 0.83 1.03 also rescale rotation 0.31 0.51 1.41 w/o depth (norm-onlyΠ) 0.44 0.74 1.04 ★ParaScale (full) 0.28 0.46 1.02 Table 3:Ablation.Per-frameαt (Prop. 2) and a depth estimate are both needed; rescaling ...

  5. [5]

    Motionmaster: Training-free camera motion transfer for video generation.arXiv preprint arXiv:2404.15789, 2024

    Teng Hu, Jiangning Zhang, Ran Yi, Yating Wang, Hongrui Huang, Jieyu Weng, Yabiao Wang, and Lizhuang Ma. Motionmaster: Training-free camera motion transfer for video generation.arXiv preprint arXiv:2404.15789, 2024

  6. [6]

    Dive: Dit- basedvideogenerationwithenhancedcontrol.arXiv preprint arXiv:2409.01595, 2024

    Junpeng Jiang, Gangyi Hong, Lijun Zhou, Enhui Ma, Hengtong Hu, Xia Zhou, Jie Xiang, Fan Liu, Kaicheng Yu, Haiyang Sun, et al. Dive: Dit- basedvideogenerationwithenhancedcontrol.arXiv preprint arXiv:2409.01595, 2024

  7. [7]

    Drivegan: Towards a controllable high-quality neural simulation

    Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable high-quality neural simulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5820–5829, 2021

  8. [8]

    Magicmotion: Control- lable video generation with dense-to-sparse trajec- tory guidance.arXiv preprint arXiv:2503.16421, 2025

    Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, and Zuxuan Wu. Magicmotion: Control- lable video generation with dense-to-sparse trajec- tory guidance.arXiv preprint arXiv:2503.16421, 2025

  9. [9]

    Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2024

  10. [10]

    Omnidirector: 5 General multi-shot camera cloning without cross- paired data.arXiv preprint arXiv:2606.13432, 2026

    Jiwen Liu, Shujuan Li, Zhixue Fang, Xiaohan Li, Yan Zhou, Zijie Meng, Zhimin Zhang, Yawen Luo, Guoxin Zhang, Yu-Shen Liu, et al. Omnidirector: 5 General multi-shot camera cloning without cross- paired data.arXiv preprint arXiv:2606.13432, 2026

  11. [11]

    Synpo: Boosting training-free few-shot medical seg- mentation via high-quality negative prompts

    Yufei Liu, Haoke Xiao, Jiaxing Chai, Yongcun Zhang, Rong Wang, Zijie Meng, and Zhiming Luo. Synpo: Boosting training-free few-shot medical seg- mentation via high-quality negative prompts. InIn- ternational Conference on Medical Image Comput- ing and Computer-Assisted Intervention, pages 594–

  12. [12]

    Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation

    Zhijian Liu, Haotian Tang, Alexander Amini, Xingyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. InIEEE In- ternational Conference on Robotics and Automation (ICRA), 2023

  13. [13]

    Make a game: A novel paradigm for interactive game rendering

    Zijie Meng, Jinming Che, Bingcai Wei, and Xixin Cao. Make a game: A novel paradigm for interactive game rendering. InICASSP 2026-2026 IEEE Inter- national Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), pages 1026–1030. IEEE, 2026

  14. [14]

    Trident: Breaking the hybrid-safety-physics cou- pling for provably safe multi-agent reinforcement learning, 2026

    Zijie Meng, Ziwei Li, Yufei Liu, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, and Miao Zhang. Trident: Breaking the hybrid-safety-physics cou- pling for provably safe multi-agent reinforcement learning, 2026

  15. [15]

    Argus: Stacked multi-view identity mosaic injection for subject-preserving video gener- ation.arXiv preprint arXiv:2606.11670, 2026

    Zijie Meng, Jiwen Liu, Yufei Liu, Chengzhuo Tong, Xiaoqiang Liu, Yuanxing Zhang, Yulong Xu, and Pengfei Wan. Argus: Stacked multi-view identity mosaic injection for subject-preserving video gener- ation.arXiv preprint arXiv:2606.11670, 2026

  16. [16]

    Omnidrive: To- wards unified next-gen controllable multi-view driv- ing video generation with llm-guided world model

    ZijieMeng, BingcaiWei, ShuqinChen, JinmingChe, Xinyan Cao, and JinLong Lin. Omnidrive: To- wards unified next-gen controllable multi-view driv- ing video generation with llm-guided world model

  17. [17]

    Orpaint: a zero-shot inpainting model for oracle bone inscription rubbings with visual mamba block.Science China Information Sciences, 68(8):189102, 2025

    Zijie Meng, Yuanze Zeng, Xiang Chang, Tianshuo Xu, Fei Chao, Xixin Cao, Changjing Shang, and Qiang Shen. Orpaint: a zero-shot inpainting model for oracle bone inscription rubbings with visual mamba block.Science China Information Sciences, 68(8):189102, 2025

  18. [18]

    Dreamfusion: Text-to-3d using 2d dif- fusion.arXiv preprint arXiv:2209.14988, 2022

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d dif- fusion.arXiv preprint arXiv:2209.14988, 2022

  19. [19]

    Language conditioned traffic generation.arXiv preprint arXiv:2307.07947, 2023

    Shuhan Tan, Boris Ivanovic, Xinshuo Weng, Marco Pavone, and Philipp Kraehenbuehl. Language conditioned traffic generation.arXiv preprint arXiv:2307.07947, 2023

  20. [20]

    Wan: Open and ad- vanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and ad- vanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  21. [21]

    Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation

    Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, and Kun Gai. Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation. InProceedings of the Spe- cial Interest Group on Computer Graphics and In- teractive Techniques Conference Conference Papers, pages 1–10, 2025

  22. [22]

    Videocomposer: Com- positional video synthesis with motion controllabil- ity.Advances in Neural Information Processing Sys- tems, 36:7594–7611, 2023

    Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, DeliZhao, andJingrenZhou. Videocomposer: Com- positional video synthesis with motion controllabil- ity.Advances in Neural Information Processing Sys- tems, 36:7594–7611, 2023

  23. [23]

    Motionctrl: A unified and flexible mo- tion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible mo- tion controller for video generation. InACM SIG- GRAPH 2024 Conference Papers, pages 1–11, 2024

  24. [24]

    Rusid: Robust uncertainty-aware single image deraining beyond certainty

    Bingcai Wei, Hui Liu, Chuang Qian, Zijian Li, and Zijie Meng. Rusid: Robust uncertainty-aware single image deraining beyond certainty

  25. [25]

    Robust single image sand removal by leveraging uncertainty-aware sam priors and prompt learning with refined perceptual loss

    Bingcai Wei, Hui Liu, Chuang Qian, Zijian Li, Wangyu Wu, and Zijie Meng. Robust single image sand removal by leveraging uncertainty-aware sam priors and prompt learning with refined perceptual loss. InProceedings of the 33rd ACM International Conference on Multimedia, pages 4932–4941, 2025

  26. [26]

    Drivescape: Towards high-resolution controllable multi-view driving video generation.arXiv preprint arXiv:2409.05463, 2024

    Wei Wu, Xi Guo, Weixuan Tang, Tingxuan Huang, Chiyu Wang, Dongyue Chen, and Chenjing Ding. Drivescape: Towards high-resolution controllable multi-view driving video generation.arXiv preprint arXiv:2409.05463, 2024

  27. [27]

    Generating multimodal driving scenes via next-scene prediction

    Yanhao Wu, Haoyang Zhang, Tianwei Lin, Lichao Huang, Shujie Luo, Rui Wu, Congpei Qiu, Wei Ke, and Tong Zhang. Generating multimodal driving scenes via next-scene prediction. InProceedings of the Computer Vision and Pattern Recognition Con- ference, pages 6844–6853, 2025

  28. [28]

    Nvs-solver: Video diffusion model as zero-shot novel view synthesizer.arXiv preprint arXiv:2405.15364, 2024

    Meng You, Zhiyu Zhu, Hui Liu, and Junhui Hou. Nvs-solver: Video diffusion model as zero-shot novel view synthesizer.arXiv preprint arXiv:2405.15364, 2024. 6