pith. sign in

arxiv: 2606.17310 · v1 · pith:VGZXGSTWnew · submitted 2026-06-15 · 💻 cs.CV

SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues

Pith reviewed 2026-06-27 03:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords video retakingcamera controllabilitySierpinski triangletexture cuesvideo diffusionviewpoint changesRoPE indicesgeometric consistency
0
0 comments X

The pith

SierpinskiCam augments geometry guidance with Sierpinski dome texture cues to maintain camera control during large viewpoint shifts in video retaking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video retaking generates novel views along a user-specified camera trajectory from one monocular video. Geometry-only guidance from reconstructed 4D representations weakens or vanishes when the target camera departs far from the source path, leaving new scene regions unsupported. SierpinskiCam adds texture cues rendered from a Sierpinski dome pattern that supply dense trackable features even across extreme viewpoint changes. It further conditions the diffusion model on the source video by appending its tokens to the target sequence and isolating the streams with negative RoPE indices, achieving appearance grounding with no model changes or per-video fine-tuning. Experiments report gains in camera controllability, geometric consistency, and final video quality across varied retaking cases.

Core claim

SierpinskiCam augments geometry-based guidance with Sierpinski dome texture cues that contain rich trackable features even under large viewpoint changes. It introduces a reference video conditioning mechanism that appends source-video tokens to the target-token sequence and separates the two streams with negative RoPE indices, enabling appearance grounding without architectural modification or per-video adaptation. Extensive experiments show that SierpinskiCam achieves significant gains in camera controllability, geometric consistency, and video quality across diverse and challenging retaking scenarios.

What carries the argument

Sierpinski dome texture cues that supply persistent trackable features, combined with negative-RoPE separation of source and target video token streams for reference conditioning.

If this is right

  • Target camera paths can deviate substantially farther from the source trajectory while retaining usable guidance.
  • Newly revealed scene regions stay geometrically consistent because the added cues remain trackable.
  • Appearance grounding succeeds without any per-video adaptation or architecture changes to the underlying diffusion model.
  • The same gains appear across diverse and challenging retaking scenarios without task-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fractal-pattern cue idea could be tested in other diffusion pipelines that need viewpoint-robust conditioning, such as novel-view video synthesis.
  • Negative-RoPE stream separation might reduce interference in any token-based reference conditioning setup inside diffusion transformers.
  • Replacing the fixed Sierpinski pattern with an optimized or scene-adaptive texture could further increase feature density in low-texture regions.

Load-bearing premise

Sierpinski dome texture cues will reliably contain rich trackable features even under large viewpoint changes.

What would settle it

Run the method on a source video whose target trajectory produces large viewpoint changes that render the projected Sierpinski pattern features untrackable; if retaking quality then matches or falls below the geometry-only baseline, the central claim is false.

Figures

Figures reproduced from arXiv: 2606.17310 by Hyelin Nam, Jeong Joon Park, Supasorn Suwajanakorn, Suttisak Wizadwongsa.

Figure 1
Figure 1. Figure 1: SierpinskiCam for video retake generation. Given a source video and a target camera trajectory (blue → red), SierpinskiCam retakes the video under user-defined camera motions. Even under large viewpoint changes with sparse source evidence, our Sierpinski textured dome (top condition video) and negative rotary position embedding allow faithful following of the target camera trajectory while preserving the o… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SierpinskiCam. Our model processes two parallel streams: (1) a target stream (top) where noisy target latents zt are concatenated with the Sierpinski-dome camera-controlling video c, using positive RoPE indices for spatial alignment; and (2) a source stream (bottom) where noised and clean source latents are concatenated using negative RoPE indices. This negative indexing isolates the source con… view at source ↗
Figure 3
Figure 3. Figure 3: Multi-scale robustness. Un￾like the Checkerboard pattern (a), the Sierpinski fractal pattern (b) provides structural details in both near and far views, ensuring camera pose control across scales. Sierpinski textured dome. This motivates an additional camera-motion cue that remains informative beyond the valid coverage of the source-derived geometry. We tex￾ture the dome with a Sierpinski fractal triangle … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on the DAVIS dataset. For each scene, top row shows the source video and target trajectory, where the camera moves from blue → red. Note how prior methods incorrectly put main characters (human/swan) at the center of the frames or create spurious artifacts, while SierpinskiCam accurately follows the user-defined camera paths. Left: the person should go down since the camera is fixed.… view at source ↗
Figure 5
Figure 5. Figure 5: (a) Pattern-only trackability over 10 camera trajectories. Ours (Sierpinski) yields the most RANSAC-verified SIFT inliers per frame pair. (b) Representative frame-t to frame-t+5 matches; for clarity, only the top 10% inliers by Lowe ratio are drawn, while titles report total inliers [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Failure cases of TrajectoryCrafter. When the target camera moves beyond the original scene coverage, it may hallucinate content or fail to follow the camera pose, especially as the main object becomes small or leaves the target-view frustum. We observed that TrajectoryCrafter [29] often fails when the target camera trajectory extends far beyond the original scene coverage. As shown in [PITH_FULL_IMAGE:fig… view at source ↗
Figure 7
Figure 7. Figure 7: Geometry proxy design and conditioning video examples. (a) The Sierpinski triangle pattern used as a geometry proxy for camera motion conditioning. Its self-similar structure provides rich spatial cues without scene-specific texture, enabling generalizable control. (b) Example condi￾tioning videos under two representative depth regimes: far-field (distant background) and near-field (close background), illu… view at source ↗
Figure 8
Figure 8. Figure 8: Additional camera trajectories used for evaluation. Each row shows the geometry proxy frames used as video conditioning input for four additional camera trajectories. The sampled frames illustrate the spatial extent and viewpoint variation induced by each trajectory. Although DA3 represents one of the latest advances in dynamic camera motion estimation, we observe that its estimated camera paths still cont… view at source ↗
Figure 9
Figure 9. Figure 9: Screenshot of the user study inter [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional comparison on newly added challenging camera trajectories. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional qualitative comparison on the DAVIS dataset. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Additional qualitative comparison on generated video by Veo. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional qualitative comparison on generated video by Veo. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
read the original abstract

Generating novel renderings of a scene along user-defined camera trajectories from a single monocular video, dubbed video retaking, is a compelling but difficult problem in content creation and visual effects. Existing geometry-guided approaches reconstruct a 4D representation from the source video and render it along the target trajectory to condition video diffusion models. However, this guidance degrades as the target camera departs from the source trajectory, leaving newly revealed regions sparse or entirely missing. We propose SierpinskiCam, which addresses this limitation by augmenting geometry-based guidance with Sierpinski dome texture cues that contains rich trackable features even under large viewpoint changes. We further introduce a reference video conditioning mechanism that appends source-video tokens to the target-token sequence and separates the two streams with negative RoPE indices, enabling appearance grounding without architectural modification or per-video adaptation. Extensive experiments show that SierpinskiCam achieves significant gains in camera controllability, geometric consistency, and video quality across diverse and challenging retaking scenarios. Project page: https://hyelinnam.github.io/SierpinskiCam/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SierpinskiCam for video retaking from a single monocular video along user-specified camera trajectories. It augments standard geometry-based guidance (which degrades for out-of-trajectory viewpoints) with Sierpinski dome texture cues asserted to supply rich trackable features even under large viewpoint changes, and adds a reference conditioning scheme that appends source-video tokens with negative RoPE indices to ground appearance without architectural changes or per-video fine-tuning. The authors state that extensive experiments demonstrate significant gains in camera controllability, geometric consistency, and video quality.

Significance. If the reported gains are reproducible and the Sierpinski pattern's contribution can be isolated, the method would offer a lightweight, additive improvement to existing geometry-guided video diffusion pipelines for handling novel viewpoints. The negative-RoPE conditioning trick is a practical engineering contribution that avoids model surgery. These elements could be useful in VFX and content creation, but the significance hinges on whether the empirical claims are supported by properly controlled experiments.

major comments (2)
  1. [Abstract] Abstract: the claim of 'significant gains' in controllability, consistency, and quality is asserted without any quantitative metrics, ablation tables, or error analysis, so it is impossible to determine whether the experiments actually support the central claims or contain post-hoc choices.
  2. [Method] Method / Experiments: the central claim requires that the Sierpinski dome texture cues reliably supply rich trackable features when target cameras depart far from the source trajectory (where geometry guidance becomes sparse). No ablation or isolated analysis is described that separates the fractal pattern's contribution from the rest of the pipeline or from the negative-RoPE conditioning.
minor comments (1)
  1. [Abstract] Abstract: grammatical error ('cues that contains' should be 'contain').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify opportunities to strengthen the presentation of our experimental results. We address each point below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'significant gains' in controllability, consistency, and quality is asserted without any quantitative metrics, ablation tables, or error analysis, so it is impossible to determine whether the experiments actually support the central claims or contain post-hoc choices.

    Authors: The abstract is a high-level summary, but the referee is correct that it would benefit from explicit quantitative support. The main manuscript contains tables with metrics (camera pose error, geometric consistency scores, and perceptual quality) and ablation results. We will revise the abstract to include the key numerical improvements reported in Section 4. revision: yes

  2. Referee: [Method] Method / Experiments: the central claim requires that the Sierpinski dome texture cues reliably supply rich trackable features when target cameras depart far from the source trajectory (where geometry guidance becomes sparse). No ablation or isolated analysis is described that separates the fractal pattern's contribution from the rest of the pipeline or from the negative-RoPE conditioning.

    Authors: We agree that an explicit isolation of the Sierpinski dome contribution is necessary to substantiate the central claim. The current manuscript contains comparative experiments, but does not include a dedicated ablation that holds the negative-RoPE component fixed while varying only the texture cues. We will add this analysis, including feature-tracking visualizations for large viewpoint deviations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an empirical augmentation without self-referential derivations

full rationale

The paper proposes SierpinskiCam as an additive technique (Sierpinski dome texture cues plus negative-RoPE reference conditioning) layered on existing geometry-guided video diffusion pipelines. No equations, first-principles derivations, or predictions appear that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claims rest on experimental outcomes across retaking scenarios rather than on any load-bearing uniqueness theorem or ansatz smuggled via prior self-work. The fractal cue choice is presented as a design decision justified by known self-similarity properties, not as a result derived from the target metrics. This is a standard self-contained engineering contribution with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5738 in / 1200 out tokens · 53691 ms · 2026-06-27T03:17:56.996740+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 2 canonical work pages

  1. [1]

    Met3r: Measuring multi-view consistency in generated images

    Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6034–6044, 2025

  2. [2]

    Recammaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. arXiv preprint arXiv:2503.11647, 2025

  3. [3]

    Freeorbit4d: Training-free arbitrary camera redirection for monocular videos via geometry-complete 4d reconstruction.arXiv preprint arXiv:2601.18993, 2026

    Wei Cao, Hao Zhang, Fengrui Tian, Yulun Wu, Yingying Li, Shenlong Wang, Ning Yu, and Yaoyao Liu. Freeorbit4d: Training-free arbitrary camera redirection for monocular videos via geometry-complete 4d reconstruction.arXiv preprint arXiv:2601.18993, 2026

  4. [4]

    Reconstruct, inpaint, finetune: Dynamic novel-view synthesis from monocular videos.arXiv preprint arXiv:2507.12646, 2025

    Kaihua Chen, Tarasha Khurana, and Deva Ramanan. Reconstruct, inpaint, finetune: Dynamic novel-view synthesis from monocular videos.arXiv preprint arXiv:2507.12646, 2025

  5. [5]

    Unreal engine

    Epic Games. Unreal engine. URLhttps://www.unrealengine.com

  6. [6]

    Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartog- raphy

    Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, June 1981. doi: 10.1145/358669.358692

  7. [7]

    Google DeepMind. Veo 3. https://deepmind.google/models/veo/, 2025. Model card, accessed 25/Jul/2025

  8. [8]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  9. [9]

    Reangle-a-video: 4d video generation as video-to-video translation.arXiv preprint arXiv:2503.09151, 2025

    Hyeonho Jeong, Suhyeon Lee, and Jong Chul Ye. Reangle-a-video: 4d video generation as video-to-video translation.arXiv preprint arXiv:2503.09151, 2025

  10. [10]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  11. [11]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

  12. [12]

    Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  13. [13]

    Vista4d: Video reshooting with 4d point clouds.arXiv preprint arXiv:2604.21915, 2026

    Kuan Heng Lin, Zhizheng Liu, Pablo Salamanca, Yash Kant, Ryan Burgert, Yuancheng Xu, Koichi Namekata, Yiwei Zhao, Bolei Zhou, Micah Goldblum, et al. Vista4d: Video reshooting with 4d point clouds.arXiv preprint arXiv:2604.21915, 2026

  14. [14]

    David G. Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 60(2):91–110, 2004. doi: 10.1023/B:VISI.0000029664.99615.94

  15. [15]

    Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  16. [16]

    Steerx: Creating any camera-free 3d and 4d scenes with geometric steering

    Byeongjun Park, Hyojun Go, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, and Changick Kim. Steerx: Creating any camera-free 3d and 4d scenes with geometric steering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27326–27337, 2025

  17. [17]

    Redirector: Creating any-length video retakes with rotary camera encoding.arXiv preprint arXiv:2511.19827, 2025

    Byeongjun Park, Byung-Hoon Kim, Hyungjin Chung, and Jong Chul Ye. Redirector: Creating any-length video retakes with rotary camera encoding.arXiv preprint arXiv:2511.19827, 2025

  18. [18]

    Zero4d: Training-free 4d video generation from single video using off-the-shelf video diffusion.arXiv preprint arXiv:2503.22622, 2025

    Jangho Park, Taesung Kwon, and Jong Chul Ye. Zero4d: Training-free 4d video generation from single video using off-the-shelf video diffusion.arXiv preprint arXiv:2503.22622, 2025

  19. [19]

    The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 10

  20. [20]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  21. [21]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  22. [22]

    Generative camera dolly: Extreme monocular dynamic novel view synthesis

    Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InEuropean Conference on Computer Vision, pages 313–331. Springer, 2024

  23. [23]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  24. [24]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 5294–5306, 2025

  25. [25]

    Scale independent tracking pattern

    Kevin Wooley and Ronald Mallet. Scale independent tracking pattern. U.S. Patent US9672417B2, June 2017. URL https://patents.google.com/patent/US9672417B2/en. Assigned to Lucasfilm Entertainment Co. Ltd

  26. [26]

    Spatialtrackerv2: 3d point tracking made easy.arXiv preprint arXiv:2507.12462, 2025

    Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: 3d point tracking made easy.arXiv preprint arXiv:2507.12462, 2025

  27. [27]

    Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

    Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, and Xingang Pan. Trajectory attention for fine-grained video motion control.arXiv preprint arXiv:2411.19324, 2024

  28. [28]

    Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  29. [29]

    Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025

    Mark YU, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025

  30. [30]

    Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning

    David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2050–2062, 2025

  31. [31]

    A flexible new technique for camera calibration.IEEE Transactions on pattern analysis and machine intelligence, 22(11):1330–1334, 2000

    Zhengyou Zhang. A flexible new technique for camera calibration.IEEE Transactions on pattern analysis and machine intelligence, 22(11):1330–1334, 2000

  32. [32]

    Stereo magnification: Learning view synthesis using multiplane images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images. InSIGGRAPH, 2018. 11 A Motivation Figure 6:Failure cases of TrajectoryCrafter.When the target camera moves beyond the original scene coverage, it may hallucinate content or fail to follow the camera pose, especia...

  33. [33]

    Overall preference Please rate each result based on your overall preference, considering visual quality, realism, temporal coherence, and similarity to the source video

  34. [34]

    Camera motion accuracy Please rate each result based on how well its camera motion follows the target trajectory described in the question

  35. [35]

    A higher score means less flickering, fewer unexpected changes in the subject/background, and better preservation of the source identity and geometry

    Stability & source consistency Please rate each result based on temporal stability and consistency with the source video. A higher score means less flickering, fewer unexpected changes in the subject/background, and better preservation of the source identity and geometry. Table 4: Instructions used for the user study. Figure 9: Screenshot of the user stud...