SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues

Hyelin Nam; Jeong Joon Park; Supasorn Suwajanakorn; Suttisak Wizadwongsa

arxiv: 2606.17310 · v1 · pith:VGZXGSTWnew · submitted 2026-06-15 · 💻 cs.CV

SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues

Suttisak Wizadwongsa , Hyelin Nam , Supasorn Suwajanakorn , Jeong Joon Park This is my paper

Pith reviewed 2026-06-27 03:17 UTC · model grok-4.3

classification 💻 cs.CV

keywords video retakingcamera controllabilitySierpinski triangletexture cuesvideo diffusionviewpoint changesRoPE indicesgeometric consistency

0 comments

The pith

SierpinskiCam augments geometry guidance with Sierpinski dome texture cues to maintain camera control during large viewpoint shifts in video retaking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video retaking generates novel views along a user-specified camera trajectory from one monocular video. Geometry-only guidance from reconstructed 4D representations weakens or vanishes when the target camera departs far from the source path, leaving new scene regions unsupported. SierpinskiCam adds texture cues rendered from a Sierpinski dome pattern that supply dense trackable features even across extreme viewpoint changes. It further conditions the diffusion model on the source video by appending its tokens to the target sequence and isolating the streams with negative RoPE indices, achieving appearance grounding with no model changes or per-video fine-tuning. Experiments report gains in camera controllability, geometric consistency, and final video quality across varied retaking cases.

Core claim

SierpinskiCam augments geometry-based guidance with Sierpinski dome texture cues that contain rich trackable features even under large viewpoint changes. It introduces a reference video conditioning mechanism that appends source-video tokens to the target-token sequence and separates the two streams with negative RoPE indices, enabling appearance grounding without architectural modification or per-video adaptation. Extensive experiments show that SierpinskiCam achieves significant gains in camera controllability, geometric consistency, and video quality across diverse and challenging retaking scenarios.

What carries the argument

Sierpinski dome texture cues that supply persistent trackable features, combined with negative-RoPE separation of source and target video token streams for reference conditioning.

If this is right

Target camera paths can deviate substantially farther from the source trajectory while retaining usable guidance.
Newly revealed scene regions stay geometrically consistent because the added cues remain trackable.
Appearance grounding succeeds without any per-video adaptation or architecture changes to the underlying diffusion model.
The same gains appear across diverse and challenging retaking scenarios without task-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fractal-pattern cue idea could be tested in other diffusion pipelines that need viewpoint-robust conditioning, such as novel-view video synthesis.
Negative-RoPE stream separation might reduce interference in any token-based reference conditioning setup inside diffusion transformers.
Replacing the fixed Sierpinski pattern with an optimized or scene-adaptive texture could further increase feature density in low-texture regions.

Load-bearing premise

Sierpinski dome texture cues will reliably contain rich trackable features even under large viewpoint changes.

What would settle it

Run the method on a source video whose target trajectory produces large viewpoint changes that render the projected Sierpinski pattern features untrackable; if retaking quality then matches or falls below the geometry-only baseline, the central claim is false.

Figures

Figures reproduced from arXiv: 2606.17310 by Hyelin Nam, Jeong Joon Park, Supasorn Suwajanakorn, Suttisak Wizadwongsa.

**Figure 1.** Figure 1: SierpinskiCam for video retake generation. Given a source video and a target camera trajectory (blue → red), SierpinskiCam retakes the video under user-defined camera motions. Even under large viewpoint changes with sparse source evidence, our Sierpinski textured dome (top condition video) and negative rotary position embedding allow faithful following of the target camera trajectory while preserving the o… view at source ↗

**Figure 2.** Figure 2: Overview of SierpinskiCam. Our model processes two parallel streams: (1) a target stream (top) where noisy target latents zt are concatenated with the Sierpinski-dome camera-controlling video c, using positive RoPE indices for spatial alignment; and (2) a source stream (bottom) where noised and clean source latents are concatenated using negative RoPE indices. This negative indexing isolates the source con… view at source ↗

**Figure 3.** Figure 3: Multi-scale robustness. Unlike the Checkerboard pattern (a), the Sierpinski fractal pattern (b) provides structural details in both near and far views, ensuring camera pose control across scales. Sierpinski textured dome. This motivates an additional camera-motion cue that remains informative beyond the valid coverage of the source-derived geometry. We texture the dome with a Sierpinski fractal triangle … view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on the DAVIS dataset. For each scene, top row shows the source video and target trajectory, where the camera moves from blue → red. Note how prior methods incorrectly put main characters (human/swan) at the center of the frames or create spurious artifacts, while SierpinskiCam accurately follows the user-defined camera paths. Left: the person should go down since the camera is fixed.… view at source ↗

**Figure 5.** Figure 5: (a) Pattern-only trackability over 10 camera trajectories. Ours (Sierpinski) yields the most RANSAC-verified SIFT inliers per frame pair. (b) Representative frame-t to frame-t+5 matches; for clarity, only the top 10% inliers by Lowe ratio are drawn, while titles report total inliers [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Failure cases of TrajectoryCrafter. When the target camera moves beyond the original scene coverage, it may hallucinate content or fail to follow the camera pose, especially as the main object becomes small or leaves the target-view frustum. We observed that TrajectoryCrafter [29] often fails when the target camera trajectory extends far beyond the original scene coverage. As shown in [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 7.** Figure 7: Geometry proxy design and conditioning video examples. (a) The Sierpinski triangle pattern used as a geometry proxy for camera motion conditioning. Its self-similar structure provides rich spatial cues without scene-specific texture, enabling generalizable control. (b) Example conditioning videos under two representative depth regimes: far-field (distant background) and near-field (close background), illu… view at source ↗

**Figure 8.** Figure 8: Additional camera trajectories used for evaluation. Each row shows the geometry proxy frames used as video conditioning input for four additional camera trajectories. The sampled frames illustrate the spatial extent and viewpoint variation induced by each trajectory. Although DA3 represents one of the latest advances in dynamic camera motion estimation, we observe that its estimated camera paths still cont… view at source ↗

**Figure 9.** Figure 9: Screenshot of the user study inter [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Additional comparison on newly added challenging camera trajectories. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Additional qualitative comparison on the DAVIS dataset. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Additional qualitative comparison on generated video by Veo. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Additional qualitative comparison on generated video by Veo. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

Generating novel renderings of a scene along user-defined camera trajectories from a single monocular video, dubbed video retaking, is a compelling but difficult problem in content creation and visual effects. Existing geometry-guided approaches reconstruct a 4D representation from the source video and render it along the target trajectory to condition video diffusion models. However, this guidance degrades as the target camera departs from the source trajectory, leaving newly revealed regions sparse or entirely missing. We propose SierpinskiCam, which addresses this limitation by augmenting geometry-based guidance with Sierpinski dome texture cues that contains rich trackable features even under large viewpoint changes. We further introduce a reference video conditioning mechanism that appends source-video tokens to the target-token sequence and separates the two streams with negative RoPE indices, enabling appearance grounding without architectural modification or per-video adaptation. Extensive experiments show that SierpinskiCam achieves significant gains in camera controllability, geometric consistency, and video quality across diverse and challenging retaking scenarios. Project page: https://hyelinnam.github.io/SierpinskiCam/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sierpinski patterns as viewpoint-robust cues plus negative-RoPE conditioning form the concrete new pieces, but the abstract gives no numbers to check if they actually deliver the claimed gains.

read the letter

The paper's main addition is pairing Sierpinski dome textures with geometry guidance for video retaking, plus a negative-RoPE trick that appends source tokens to the target sequence without model changes or per-video tuning. This targets the sparsity problem when target cameras move far from the source path.

The approach is straightforward engineering on top of existing diffusion pipelines. The fractal pattern choice is specific and aims to keep trackable features available at large viewpoint shifts where standard geometry fails. The conditioning method avoids architectural tweaks, which keeps it easy to apply.

The abstract claims significant improvements in controllability, consistency, and quality across challenging cases. If the full paper shows clean ablations and quantitative tables that isolate the pattern's role, that would be useful incremental work for people doing camera-controlled video generation.

The soft spot is that no results, ablations, or error breakdowns appear in the provided text. The central assumption—that the Sierpinski cues reliably supply rich features under large changes—remains untested in what we have, and the negative-RoPE benefit is also presented without separate evidence. The stress-test note on isolation is fair based on the abstract alone.

This is for researchers working on video diffusion and novel-view synthesis in computer vision. Someone already running similar conditioning experiments might test the negative-RoPE idea or try the pattern.

It deserves a serious referee because the problem is practical and the fixes are specific enough to evaluate once the numbers are in. Send it to review.

Referee Report

2 major / 1 minor

Summary. The paper introduces SierpinskiCam for video retaking from a single monocular video along user-specified camera trajectories. It augments standard geometry-based guidance (which degrades for out-of-trajectory viewpoints) with Sierpinski dome texture cues asserted to supply rich trackable features even under large viewpoint changes, and adds a reference conditioning scheme that appends source-video tokens with negative RoPE indices to ground appearance without architectural changes or per-video fine-tuning. The authors state that extensive experiments demonstrate significant gains in camera controllability, geometric consistency, and video quality.

Significance. If the reported gains are reproducible and the Sierpinski pattern's contribution can be isolated, the method would offer a lightweight, additive improvement to existing geometry-guided video diffusion pipelines for handling novel viewpoints. The negative-RoPE conditioning trick is a practical engineering contribution that avoids model surgery. These elements could be useful in VFX and content creation, but the significance hinges on whether the empirical claims are supported by properly controlled experiments.

major comments (2)

[Abstract] Abstract: the claim of 'significant gains' in controllability, consistency, and quality is asserted without any quantitative metrics, ablation tables, or error analysis, so it is impossible to determine whether the experiments actually support the central claims or contain post-hoc choices.
[Method] Method / Experiments: the central claim requires that the Sierpinski dome texture cues reliably supply rich trackable features when target cameras depart far from the source trajectory (where geometry guidance becomes sparse). No ablation or isolated analysis is described that separates the fractal pattern's contribution from the rest of the pipeline or from the negative-RoPE conditioning.

minor comments (1)

[Abstract] Abstract: grammatical error ('cues that contains' should be 'contain').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify opportunities to strengthen the presentation of our experimental results. We address each point below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'significant gains' in controllability, consistency, and quality is asserted without any quantitative metrics, ablation tables, or error analysis, so it is impossible to determine whether the experiments actually support the central claims or contain post-hoc choices.

Authors: The abstract is a high-level summary, but the referee is correct that it would benefit from explicit quantitative support. The main manuscript contains tables with metrics (camera pose error, geometric consistency scores, and perceptual quality) and ablation results. We will revise the abstract to include the key numerical improvements reported in Section 4. revision: yes
Referee: [Method] Method / Experiments: the central claim requires that the Sierpinski dome texture cues reliably supply rich trackable features when target cameras depart far from the source trajectory (where geometry guidance becomes sparse). No ablation or isolated analysis is described that separates the fractal pattern's contribution from the rest of the pipeline or from the negative-RoPE conditioning.

Authors: We agree that an explicit isolation of the Sierpinski dome contribution is necessary to substantiate the central claim. The current manuscript contains comparative experiments, but does not include a dedicated ablation that holds the negative-RoPE component fixed while varying only the texture cues. We will add this analysis, including feature-tracking visualizations for large viewpoint deviations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an empirical augmentation without self-referential derivations

full rationale

The paper proposes SierpinskiCam as an additive technique (Sierpinski dome texture cues plus negative-RoPE reference conditioning) layered on existing geometry-guided video diffusion pipelines. No equations, first-principles derivations, or predictions appear that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claims rest on experimental outcomes across retaking scenarios rather than on any load-bearing uniqueness theorem or ansatz smuggled via prior self-work. The fractal cue choice is presented as a design decision justified by known self-similarity properties, not as a result derived from the target metrics. This is a standard self-contained engineering contribution with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5738 in / 1200 out tokens · 53691 ms · 2026-06-27T03:17:56.996740+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 2 canonical work pages

[1]

Met3r: Measuring multi-view consistency in generated images

Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6034–6044, 2025

2025
[2]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. arXiv preprint arXiv:2503.11647, 2025

arXiv 2025
[3]

Freeorbit4d: Training-free arbitrary camera redirection for monocular videos via geometry-complete 4d reconstruction.arXiv preprint arXiv:2601.18993, 2026

Wei Cao, Hao Zhang, Fengrui Tian, Yulun Wu, Yingying Li, Shenlong Wang, Ning Yu, and Yaoyao Liu. Freeorbit4d: Training-free arbitrary camera redirection for monocular videos via geometry-complete 4d reconstruction.arXiv preprint arXiv:2601.18993, 2026

Pith/arXiv arXiv 2026
[4]

Reconstruct, inpaint, finetune: Dynamic novel-view synthesis from monocular videos.arXiv preprint arXiv:2507.12646, 2025

Kaihua Chen, Tarasha Khurana, and Deva Ramanan. Reconstruct, inpaint, finetune: Dynamic novel-view synthesis from monocular videos.arXiv preprint arXiv:2507.12646, 2025

arXiv 2025
[5]

Unreal engine

Epic Games. Unreal engine. URLhttps://www.unrealengine.com
[6]

Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartog- raphy

Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, June 1981. doi: 10.1145/358669.358692

work page doi:10.1145/358669.358692 1981
[7]

Google DeepMind. Veo 3. https://deepmind.google/models/veo/, 2025. Model card, accessed 25/Jul/2025

2025
[8]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[9]

Reangle-a-video: 4d video generation as video-to-video translation.arXiv preprint arXiv:2503.09151, 2025

Hyeonho Jeong, Suhyeon Lee, and Jong Chul Ye. Reangle-a-video: 4d video generation as video-to-video translation.arXiv preprint arXiv:2503.09151, 2025

arXiv 2025
[10]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[11]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

Pith/arXiv arXiv 2025
[12]

Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Pith/arXiv arXiv 2025
[13]

Vista4d: Video reshooting with 4d point clouds.arXiv preprint arXiv:2604.21915, 2026

Kuan Heng Lin, Zhizheng Liu, Pablo Salamanca, Yash Kant, Ryan Burgert, Yuancheng Xu, Koichi Namekata, Yiwei Zhao, Bolei Zhou, Micah Goldblum, et al. Vista4d: Video reshooting with 4d point clouds.arXiv preprint arXiv:2604.21915, 2026

Pith/arXiv arXiv 2026
[14]

David G. Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 60(2):91–110, 2004. doi: 10.1023/B:VISI.0000029664.99615.94

work page doi:10.1023/b:visi.0000029664.99615.94 2004
[15]

Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023
[16]

Steerx: Creating any camera-free 3d and 4d scenes with geometric steering

Byeongjun Park, Hyojun Go, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, and Changick Kim. Steerx: Creating any camera-free 3d and 4d scenes with geometric steering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27326–27337, 2025

2025
[17]

Redirector: Creating any-length video retakes with rotary camera encoding.arXiv preprint arXiv:2511.19827, 2025

Byeongjun Park, Byung-Hoon Kim, Hyungjin Chung, and Jong Chul Ye. Redirector: Creating any-length video retakes with rotary camera encoding.arXiv preprint arXiv:2511.19827, 2025

arXiv 2025
[18]

Zero4d: Training-free 4d video generation from single video using off-the-shelf video diffusion.arXiv preprint arXiv:2503.22622, 2025

Jangho Park, Taesung Kwon, and Jong Chul Ye. Zero4d: Training-free 4d video generation from single video using off-the-shelf video diffusion.arXiv preprint arXiv:2503.22622, 2025

arXiv 2025
[19]

The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 10

Pith/arXiv arXiv 2017
[20]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[21]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[22]

Generative camera dolly: Extreme monocular dynamic novel view synthesis

Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InEuropean Conference on Computer Vision, pages 313–331. Springer, 2024

2024
[23]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[24]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 5294–5306, 2025

2025
[25]

Scale independent tracking pattern

Kevin Wooley and Ronald Mallet. Scale independent tracking pattern. U.S. Patent US9672417B2, June 2017. URL https://patents.google.com/patent/US9672417B2/en. Assigned to Lucasfilm Entertainment Co. Ltd

2017
[26]

Spatialtrackerv2: 3d point tracking made easy.arXiv preprint arXiv:2507.12462, 2025

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: 3d point tracking made easy.arXiv preprint arXiv:2507.12462, 2025

arXiv 2025
[27]

Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, and Xingang Pan. Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

arXiv 2024
[28]

Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024
[29]

Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025

Mark YU, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025

arXiv 2025
[30]

Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning

David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2050–2062, 2025

2050
[31]

A flexible new technique for camera calibration.IEEE Transactions on pattern analysis and machine intelligence, 22(11):1330–1334, 2000

Zhengyou Zhang. A flexible new technique for camera calibration.IEEE Transactions on pattern analysis and machine intelligence, 22(11):1330–1334, 2000

2000
[32]

Stereo magnification: Learning view synthesis using multiplane images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images. InSIGGRAPH, 2018. 11 A Motivation Figure 6:Failure cases of TrajectoryCrafter.When the target camera moves beyond the original scene coverage, it may hallucinate content or fail to follow the camera pose, especia...

2018
[33]

Overall preference Please rate each result based on your overall preference, considering visual quality, realism, temporal coherence, and similarity to the source video
[34]

Camera motion accuracy Please rate each result based on how well its camera motion follows the target trajectory described in the question
[35]

A higher score means less flickering, fewer unexpected changes in the subject/background, and better preservation of the source identity and geometry

Stability & source consistency Please rate each result based on temporal stability and consistency with the source video. A higher score means less flickering, fewer unexpected changes in the subject/background, and better preservation of the source identity and geometry. Table 4: Instructions used for the user study. Figure 9: Screenshot of the user stud...

[1] [1]

Met3r: Measuring multi-view consistency in generated images

Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6034–6044, 2025

2025

[2] [2]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. arXiv preprint arXiv:2503.11647, 2025

arXiv 2025

[3] [3]

Freeorbit4d: Training-free arbitrary camera redirection for monocular videos via geometry-complete 4d reconstruction.arXiv preprint arXiv:2601.18993, 2026

Wei Cao, Hao Zhang, Fengrui Tian, Yulun Wu, Yingying Li, Shenlong Wang, Ning Yu, and Yaoyao Liu. Freeorbit4d: Training-free arbitrary camera redirection for monocular videos via geometry-complete 4d reconstruction.arXiv preprint arXiv:2601.18993, 2026

Pith/arXiv arXiv 2026

[4] [4]

Reconstruct, inpaint, finetune: Dynamic novel-view synthesis from monocular videos.arXiv preprint arXiv:2507.12646, 2025

Kaihua Chen, Tarasha Khurana, and Deva Ramanan. Reconstruct, inpaint, finetune: Dynamic novel-view synthesis from monocular videos.arXiv preprint arXiv:2507.12646, 2025

arXiv 2025

[5] [5]

Unreal engine

Epic Games. Unreal engine. URLhttps://www.unrealengine.com

[6] [6]

Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartog- raphy

Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, June 1981. doi: 10.1145/358669.358692

work page doi:10.1145/358669.358692 1981

[7] [7]

Google DeepMind. Veo 3. https://deepmind.google/models/veo/, 2025. Model card, accessed 25/Jul/2025

2025

[8] [8]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024

[9] [9]

Reangle-a-video: 4d video generation as video-to-video translation.arXiv preprint arXiv:2503.09151, 2025

Hyeonho Jeong, Suhyeon Lee, and Jong Chul Ye. Reangle-a-video: 4d video generation as video-to-video translation.arXiv preprint arXiv:2503.09151, 2025

arXiv 2025

[10] [10]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[11] [11]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

Pith/arXiv arXiv 2025

[12] [12]

Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Pith/arXiv arXiv 2025

[13] [13]

Vista4d: Video reshooting with 4d point clouds.arXiv preprint arXiv:2604.21915, 2026

Kuan Heng Lin, Zhizheng Liu, Pablo Salamanca, Yash Kant, Ryan Burgert, Yuancheng Xu, Koichi Namekata, Yiwei Zhao, Bolei Zhou, Micah Goldblum, et al. Vista4d: Video reshooting with 4d point clouds.arXiv preprint arXiv:2604.21915, 2026

Pith/arXiv arXiv 2026

[14] [14]

David G. Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 60(2):91–110, 2004. doi: 10.1023/B:VISI.0000029664.99615.94

work page doi:10.1023/b:visi.0000029664.99615.94 2004

[15] [15]

Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023

[16] [16]

Steerx: Creating any camera-free 3d and 4d scenes with geometric steering

Byeongjun Park, Hyojun Go, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, and Changick Kim. Steerx: Creating any camera-free 3d and 4d scenes with geometric steering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27326–27337, 2025

2025

[17] [17]

Redirector: Creating any-length video retakes with rotary camera encoding.arXiv preprint arXiv:2511.19827, 2025

Byeongjun Park, Byung-Hoon Kim, Hyungjin Chung, and Jong Chul Ye. Redirector: Creating any-length video retakes with rotary camera encoding.arXiv preprint arXiv:2511.19827, 2025

arXiv 2025

[18] [18]

Zero4d: Training-free 4d video generation from single video using off-the-shelf video diffusion.arXiv preprint arXiv:2503.22622, 2025

Jangho Park, Taesung Kwon, and Jong Chul Ye. Zero4d: Training-free 4d video generation from single video using off-the-shelf video diffusion.arXiv preprint arXiv:2503.22622, 2025

arXiv 2025

[19] [19]

The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 10

Pith/arXiv arXiv 2017

[20] [20]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[21] [21]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[22] [22]

Generative camera dolly: Extreme monocular dynamic novel view synthesis

Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InEuropean Conference on Computer Vision, pages 313–331. Springer, 2024

2024

[23] [23]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[24] [24]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 5294–5306, 2025

2025

[25] [25]

Scale independent tracking pattern

Kevin Wooley and Ronald Mallet. Scale independent tracking pattern. U.S. Patent US9672417B2, June 2017. URL https://patents.google.com/patent/US9672417B2/en. Assigned to Lucasfilm Entertainment Co. Ltd

2017

[26] [26]

Spatialtrackerv2: 3d point tracking made easy.arXiv preprint arXiv:2507.12462, 2025

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: 3d point tracking made easy.arXiv preprint arXiv:2507.12462, 2025

arXiv 2025

[27] [27]

Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, and Xingang Pan. Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

arXiv 2024

[28] [28]

Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024

[29] [29]

Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025

Mark YU, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025

arXiv 2025

[30] [30]

Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning

David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2050–2062, 2025

2050

[31] [31]

A flexible new technique for camera calibration.IEEE Transactions on pattern analysis and machine intelligence, 22(11):1330–1334, 2000

Zhengyou Zhang. A flexible new technique for camera calibration.IEEE Transactions on pattern analysis and machine intelligence, 22(11):1330–1334, 2000

2000

[32] [32]

Stereo magnification: Learning view synthesis using multiplane images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images. InSIGGRAPH, 2018. 11 A Motivation Figure 6:Failure cases of TrajectoryCrafter.When the target camera moves beyond the original scene coverage, it may hallucinate content or fail to follow the camera pose, especia...

2018

[33] [33]

Overall preference Please rate each result based on your overall preference, considering visual quality, realism, temporal coherence, and similarity to the source video

[34] [34]

Camera motion accuracy Please rate each result based on how well its camera motion follows the target trajectory described in the question

[35] [35]

A higher score means less flickering, fewer unexpected changes in the subject/background, and better preservation of the source identity and geometry

Stability & source consistency Please rate each result based on temporal stability and consistency with the source video. A higher score means less flickering, fewer unexpected changes in the subject/background, and better preservation of the source identity and geometry. Table 4: Instructions used for the user study. Figure 9: Screenshot of the user stud...