pith. sign in

arxiv: 2605.20961 · v1 · pith:5SVZAIYRnew · submitted 2026-05-20 · 💻 cs.CV

Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning

Pith reviewed 2026-05-21 05:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D video editingregion-aware conditioningvideo diffusion modelsdisocclusion handlingproxy task trainingfaithful editingPREX frameworkPREBench benchmark
0
0 comments X

The pith

Decomposing 4D video editing into Preserve, Reveal, and Expand roles fixes evidence mismatches in diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that current 4D video diffusion models entangle reliable observed evidence with unreliable cues, causing drift, ghosting, and poor extrapolation during editing. It establishes that explicitly decomposing the spatiotemporal volume into Preserve for source-backed regions, Reveal for newly visible areas, and Expand for out-of-view parts allows the creation of calibrated conditioning signals. These signals are fed into a frozen diffusion backbone using a region-aware adapter that is trained solely through proxy tasks. This leads to better handling of different region types while keeping overall video quality and 4D consistency. The introduction of PREBench provides a way to diagnose these specific issues with targeted metrics.

Core claim

PREX decomposes the target spatiotemporal volume into Preserve, Reveal, and Expand roles according to observation support and scene extent, builds observation-backed appearance cues with calibrated confidence, and injects them into a frozen video diffusion backbone through a region-aware adapter trained with proxy tasks without requiring paired edited videos.

What carries the argument

The region-aware adapter that conditions the diffusion model on role-decomposed cues derived from observation support.

If this is right

  • Reduces region-structured failures such as preservation drift and ghosting.
  • Maintains strong visual quality in edited 4D videos.
  • Preserves 4D edit control capability.
  • Allows training without paired edited video data using proxy tasks.
  • Provides diagnostic evaluation through the new PREBench benchmark with region-role masks and human-aligned metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar role decomposition could be tested in other conditional video synthesis tasks involving partial observations.
  • Applying this to longer sequences might reveal benefits for temporal consistency in edits.
  • The approach suggests a general strategy for handling uncertainty in generative models for 3D-consistent content.
  • Future work could explore automating the role assignment without manual masks.

Load-bearing premise

Proxy tasks without paired edited videos are sufficient to train the region-aware adapter to correctly assign and condition on Preserve, Reveal, and Expand roles according to observation support.

What would settle it

A direct comparison on videos with known disoccluded regions showing whether PREX produces fewer instances of ghosting or content drift than standard single-signal conditioning.

Figures

Figures reproduced from arXiv: 2605.20961 by Chunfeng Wang, Hao Li, Jiahui Yuan, Kun Zhan, Wenzhang Sun, Xiangchen Yin, Xiaoyan Sun, Zhangchi Hu.

Figure 1
Figure 1. Figure 1: Under 4D-guided video editing, coarse conditioning can cause Evidence-Role Mismatch problem. PREX separates Preserve, Reveal, and Expand regions, builds observation-backed cues and confidence maps, and injects them through a region-aware adapter with proxy-task training. Our proposed PREBench evaluates the resulting edits with region-aware metrics. Abstract Existing 4D-driven video diffusion models primari… view at source ↗
Figure 2
Figure 2. Figure 2: An overview of PREX pipeline. Unified conditioning can mix evidence roles in a user￾edited 4D proxy, leading to artifacts in revealed and expanded regions. PREX separates Preserve, Reveal, and Expand regions, conditions a frozen video diffusion model with observation-backed cues and confidence maps through a Region-aware Adapter. to preserve observed content when it remains valid and synthesize only the re… view at source ↗
Figure 3
Figure 3. Figure 3: Construction pipeline of PREBench to obtain high-quality samples for training and testing. 4.1 Proxy-task Curriculum for Supervised Training Directly supervising a 4D video editor requires paired source videos, edited 4D scenes, and edited video ground truth, which are difficult to collect at scale. PREX adopts a proxy-task curriculum built from unedited videos, as shown in [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of camera-only motion control on PREBench dataset. 5.3 Qualitative Comparisons [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of camera-object joint motion control on PREBench dataset. 𝐶!"# PREX w/o observation-backed appearance cues w/ o adapter design GT [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative ablation of observation￾backed appearance cues and adapter design. 𝐶!"# PREX w/o role regions Original Video Confidence w/o curriculum w/o confidence Object Motion [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Failure modes captured by PREBench diagnostic metrics. Visualization. We visualize typical failures diagnosed by PREBench metrics in [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Interactive 4D editing interface. The interface provides a scene-level sandbox for editing reconstructed 4D scenes. Users can load a scene, inspect the reconstructed point-based 4D representation, manipulate dynamic instances with interactive transform controls, preview the edited result from target camera views, and manage camera/object keyframes through a timeline. For each target video, PREX provides th… view at source ↗
Figure 10
Figure 10. Figure 10: More qualitative comparison of camera-only motion control on PREBench dataset. E Details of PREBench Dataset E.1 Train Set The PREBench training set contains 10,000 video samples collected from six public video datasets, as summarized in [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: More qualitative comparison of camera-object joint motion control on PREBench dataset. Edited Scene & Target Camera Candidate Visibility 𝐼!"# … 𝐼! 𝐼!$# 𝑥! … Depth Consistency Instance Consistency 2 Top-1 Selection View-time Compatibility Score 𝑠 𝑟 = ⟨𝑑! !"!, 𝑑!,$ %$&⟩ − 𝜆 𝑟 − 𝑡 𝑊 1 Validity Check -0.18 -0.05 0.46 0.62 0.79 3 Construct Outputs Appearance Cues & PREX Mask Red: Reveal Blue: Expand Confidence… view at source ↗
Figure 12
Figure 12. Figure 12: Design of Observation-backed Appearance Rendering for geometric conditions. F Observation-backed Appearance Conditioning PREX constructs the RGB control C rgb t as an observation-backed appearance field rather than directly using the appearance rendered from the edited 4D proxy. This design is motivated by the fact that rendered 4D appearance may be reliable in source-supported regions, but can become inv… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative demonstration on failure cases of PREX. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
read the original abstract

Existing 4D-driven video diffusion models primarily target plausible generation, but faithful 4D editing requires preserving source-observed regions while synthesizing disoccluded or out-of-view content. We identify Evidence-Role Mismatch: reliable source-backed evidence, unreliable rendered cues, and unsupported regions are entangled in a single conditioning signal, causing preservation drift, ghosting, and unstable extrapolation. We propose PREX (Preserve, Reveal, Expand), a region-aware framework that decomposes the target spatiotemporal volume into Preserve, Reveal, and Expand roles according to observation support and scene extent. PREX builds observation-backed appearance cues with calibrated confidence and injects them into a frozen video diffusion backbone through a region-aware adapter, trained with proxy tasks without requiring paired edited videos. We further introduce PREBench, a diagnostic benchmark with curated edits, region-role masks, and human-aligned metrics that complement global video-quality and 4D-control evaluations. Experiments show that PREX reduces region-structured failures while maintaining strong visual quality and 4D edit control capability. Project Page: https://ricepastem.github.io/PREX-Open

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes PREX, a region-aware framework for faithful 4D video editing in video diffusion models. It identifies Evidence-Role Mismatch where source-backed evidence, unreliable rendered cues, and unsupported regions are entangled, leading to preservation drift, ghosting, and unstable extrapolation. PREX decomposes the target spatiotemporal volume into Preserve, Reveal, and Expand roles according to observation support, builds calibrated appearance cues, and injects them via a region-aware adapter trained exclusively with proxy tasks (no paired edited videos) into a frozen backbone. It also introduces PREBench with curated edits, region-role masks, and human-aligned metrics. Experiments claim that PREX reduces region-structured failures while preserving visual quality and 4D edit control.

Significance. If the central claims hold, the work would represent a meaningful advance in 4D video editing by shifting focus from plausible generation to faithful editing that respects observation support. The proxy-task training strategy without paired data and the diagnostic PREBench benchmark are notable contributions that could influence subsequent research on region-structured conditioning in diffusion models. The explicit decomposition into Preserve/Reveal/Expand roles provides a clear conceptual handle on a common failure mode.

major comments (3)
  1. [Section 3.2] Section 3.2: The region-aware adapter is trained exclusively via proxy tasks without paired edited videos. The manuscript must demonstrate that these proxies reproduce the precise evidence-role mismatches that arise under real 4D camera motion and scene extent changes; otherwise the calibrated cues injected into the frozen backbone may still mix unreliable signals into Preserve regions, directly undermining the reported reduction in ghosting and drift.
  2. [Section 4.1] Section 4.1 and experimental results: The claim that PREX reduces region-structured failures rests on the adapter correctly assigning roles from observation support alone. Without ablations that isolate the proxy-trained adapter from the role decomposition itself, or quantitative tables showing per-region metrics (e.g., preservation accuracy on disoccluded areas), it is not possible to confirm that the observed improvements are load-bearing on the proposed mechanism.
  3. [PREBench] PREBench description: The benchmark supplies region-role masks, yet the paper does not detail how these masks are generated or validated against ground-truth observation support in 4D volumes. This validation is essential because the central evaluation of role-aware conditioning depends on the masks accurately reflecting Preserve/Reveal/Expand boundaries.
minor comments (2)
  1. [Abstract] The abstract introduces 'Evidence-Role Mismatch' without a concise formal definition; a one-sentence definition early in the introduction would improve readability.
  2. [Methods] Notation for the three roles (Preserve, Reveal, Expand) should be introduced with consistent symbols or abbreviations when first used in the methods to avoid ambiguity in later equations or figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below, providing clarifications and revisions where they strengthen the presentation of our contributions without altering the core claims.

read point-by-point responses
  1. Referee: [Section 3.2] Section 3.2: The region-aware adapter is trained exclusively via proxy tasks without paired edited videos. The manuscript must demonstrate that these proxies reproduce the precise evidence-role mismatches that arise under real 4D camera motion and scene extent changes; otherwise the calibrated cues injected into the frozen backbone may still mix unreliable signals into Preserve regions, directly undermining the reported reduction in ghosting and drift.

    Authors: We appreciate the referee's emphasis on validating the proxy tasks. Section 3.2 describes the proxy tasks as using synthetic masking, depth-based visibility perturbations, and simulated viewpoint shifts derived from the input 4D trajectories to emulate evidence-role mismatches. These are chosen because they directly target the entanglement of reliable source evidence with unreliable rendered cues and unsupported regions. To strengthen this, we have added a new analysis subsection with side-by-side qualitative comparisons and a small quantitative table measuring mismatch statistics (e.g., ghosting frequency under proxy vs. real camera motion) on held-out sequences. This shows that the proxies closely replicate the failure modes observed in real 4D editing, supporting the training strategy. revision: yes

  2. Referee: [Section 4.1] Section 4.1 and experimental results: The claim that PREX reduces region-structured failures rests on the adapter correctly assigning roles from observation support alone. Without ablations that isolate the proxy-trained adapter from the role decomposition itself, or quantitative tables showing per-region metrics (e.g., preservation accuracy on disoccluded areas), it is not possible to confirm that the observed improvements are load-bearing on the proposed mechanism.

    Authors: We agree that targeted ablations and per-region metrics would make the contribution of the mechanism clearer. The original experiments in Section 4.1 compare full PREX against baselines and variants, but we acknowledge the value of further isolation. In the revision we have added an ablation study that trains the adapter with and without the role decomposition (i.e., uniform vs. region-aware conditioning) and a new table reporting per-region metrics including preservation accuracy on disoccluded (Reveal) areas and extrapolation stability on Expand regions. These results indicate that the gains are attributable to the combination of role decomposition and proxy-trained adapter. revision: yes

  3. Referee: [PREBench] PREBench description: The benchmark supplies region-role masks, yet the paper does not detail how these masks are generated or validated against ground-truth observation support in 4D volumes. This validation is essential because the central evaluation of role-aware conditioning depends on the masks accurately reflecting Preserve/Reveal/Expand boundaries.

    Authors: We thank the referee for this observation. The masks are generated by computing per-voxel observation support from the input 4D reconstruction and camera poses: voxels visible in source frames are labeled Preserve, newly visible but consistent regions are Reveal, and out-of-frustum or inconsistent areas are Expand, using depth reprojection and visibility checks. In the revised manuscript we have expanded the PREBench section with the full algorithmic procedure for mask generation and added a validation paragraph describing agreement with manual annotations on a 20-sequence subset (reported as 92% label consistency). This provides the requested grounding against ground-truth observation support. revision: yes

Circularity Check

0 steps flagged

Minor self-citation not load-bearing; central framework derivation is independent

full rationale

The paper introduces a new region-aware adapter trained via proxy tasks on a frozen diffusion backbone without paired edited videos. No equations or definitions in the provided abstract or method description reduce the central claims about Evidence-Role Mismatch decomposition or role assignment to fitted parameters or self-referential constructions by construction. The adapter training and injection into the backbone are described as separate steps, with evaluation on a new diagnostic benchmark. Any self-citations appear peripheral and do not justify the core conditioning mechanism, keeping the derivation self-contained against external benchmarks and falsifiable via the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Approach rests on standard diffusion model assumptions and introduces the Evidence-Role Mismatch concept plus role decomposition without explicit free parameters or new physical entities.

axioms (1)
  • domain assumption A frozen video diffusion backbone can be effectively conditioned for editing via an added region-aware adapter.
    Central to the PREX design described in the abstract.
invented entities (1)
  • Evidence-Role Mismatch no independent evidence
    purpose: To characterize the entanglement of reliable source evidence, unreliable rendered cues, and unsupported regions in a single conditioning signal.
    Identified as the core problem motivating the Preserve-Reveal-Expand decomposition.

pith-pipeline@v0.9.0 · 5749 in / 1335 out tokens · 47906 ms · 2026-05-21T05:08:26.533367+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 7 internal anchors

  1. [1]

    Recammaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

  2. [2]

    Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025

    Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, et al. Deepverse: 4d autoregressive video generation as a world model.arXiv preprint arXiv:2506.01103, 2025

  3. [3]

    Perception-as-control: Fine-grained controllable image animation with 3d-aware motion representation

    Yingjie Chen, Yifang Men, Yuan Yao, Miaomiao Cui, and Liefeng Bo. Perception-as-control: Fine-grained controllable image animation with 3d-aware motion representation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14380–14389, 2025

  4. [4]

    Wan-move: Motion-controllable video generation via latent trajectory guidance.arXiv preprint arXiv:2512.08765, 2025

    Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, et al. Wan-move: Motion-controllable video generation via latent trajectory guidance.arXiv preprint arXiv:2512.08765, 2025

  5. [5]

    TokenFlow: Consistent Diffusion Features for Consistent Video Editing

    Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023

  6. [6]

    Diffusion as shader: 3d-aware video diffusion for versatile video generation control

    Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, et al. Diffusion as shader: 3d-aware video diffusion for versatile video generation control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–12, 2025

  7. [7]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

  8. [8]

    V oyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.ACM Transactions on Graphics (TOG), 44 (6):1–15, 2025

    Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson Lau, Wangmeng Zuo, et al. V oyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.ACM Transactions on Graphics (TOG), 44 (6):1–15, 2025

  9. [9]

    Cotracker: It is better to track together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. InEuropean conference on computer vision, pages 18–35. Springer, 2024

  10. [10]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

  11. [11]

    arXiv preprint arXiv:2512.02015 (2025)

    Yao-Chih Lee, Zhoutong Zhang, Jiahui Huang, Jui-Hsien Wang, Joon-Young Lee, Jia-Bin Huang, Eli Shechtman, and Zhengqi Li. Generative video motion editing with 3d point tracks. arXiv preprint arXiv:2512.02015, 2025

  12. [12]

    Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance

    Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, and Zuxuan Wu. Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12112–12123, 2025

  13. [13]

    Flashworld: High-quality3dscenegenerationwithinseconds.arXiv preprint arXiv:2510.13678, 2025

    Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, and Liujuan Cao. Flashworld: High-quality 3d scene generation within seconds.arXiv preprint arXiv:2510.13678, 2025

  14. [14]

    Vidtome: Video token merging for zero-shot video editing

    Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. Vidtome: Video token merging for zero-shot video editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7486–7495, 2024

  15. [15]

    Vista4D: Video Reshooting with 4D Point Clouds

    Kuan Heng Lin, Zhizheng Liu, Pablo Salamanca, Yash Kant, Ryan Burgert, Yuancheng Xu, Koichi Namekata, Yiwei Zhao, Bolei Zhou, Micah Goldblum, et al. Vista4d: Video reshooting with 4d point clouds.arXiv preprint arXiv:2604.21915, 2026. 10

  16. [16]

    Trace anything: Representing any video in 4d via trajectory fields.arXiv preprint arXiv:2510.13802, 2025

    Xinhang Liu, Yuxi Xiao, Donny Y Chen, Jiashi Feng, Yu-Wing Tai, Chi-Keung Tang, and Bingyi Kang. Trace anything: Representing any video in 4d via trajectory fields.arXiv preprint arXiv:2510.13802, 2025

  17. [17]

    Trackingworld: World-centric monocular 3d tracking of almost all pixels.arXiv preprint arXiv:2512.08358, 2025

    Jiahao Lu, Weitao Xiong, Jiacheng Deng, Peng Li, Tianyu Huang, Zhiyang Dou, Cheng Lin, Sai-Kit Yeung, and Yuan Liu. Trackingworld: World-centric monocular 3d tracking of almost all pixels.arXiv preprint arXiv:2512.08358, 2025

  18. [18]

    Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

    Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

  19. [19]

    Nerf: Representing scenes as neural radiance fields for view synthesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

  20. [20]

    Fatezero: Fusing attentions for zero-shot text-based video editing

    Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023

  21. [21]

    Gen3c: 3d-informed world- consistent video generation with precise camera control

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6121–6132, 2025

  22. [22]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  23. [23]

    arXiv preprint arXiv:2505.22944 (2025)

    Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, and Chongyang Ma. Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

  24. [24]

    Boximator: Generating rich and controllable motions for video synthesis.arXiv preprint arXiv:2402.01566, 2024

    Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, and Hang Li. Boximator: Generating rich and controllable motions for video synthesis.arXiv preprint arXiv:2402.01566, 2024

  25. [25]

    Videodirector: Precise video editing via text-to-video models

    Yukun Wang, Longguang Wang, Zhiyuan Ma, Qibin Hu, Kai Xu, and Yulan Guo. Videodirector: Precise video editing via text-to-video models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2589–2598, 2025

  26. [26]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

  27. [27]

    4d gaussian splatting for real-time dynamic scene rendering

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20310–20320, 2024

  28. [28]

    Reconfusion: 3d reconstruction with diffusion priors

    Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Reconfusion: 3d reconstruction with diffusion priors. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21551–21561, 2024

  29. [29]

    Recent advances in 3d gaussian splatting.Computational Visual Media, 10(4):613–642, 2024

    Tong Wu, Yu-Jie Yuan, Ling-Xiao Zhang, Jie Yang, Yan-Pei Cao, Ling-Qi Yan, and Lin Gao. Recent advances in 3d gaussian splatting.Computational Visual Media, 10(4):613–642, 2024

  30. [30]

    Draganything: Motion control for anything using entity representation

    Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for anything using entity representation. InEuropean Conference on Computer Vision, pages 331–348. Springer, 2024. 11

  31. [31]

    Spatialtracker: Tracking any 2d pixels in 3d space

    Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20406–20417, 2024

  32. [32]

    Motioncanvas: Cinematic shot design with controllable image- to-video generation

    Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Aniruddha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, and Feng Liu. Motioncanvas: Cinematic shot design with controllable image- to-video generation. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

  33. [33]

    Rerender a video: Zero-shot text-guided video-to-video translation

    Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation. InSIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023

  34. [34]

    Neoverse: Enhancing 4d world model with in-the-wild monocular videos.arXiv preprint arXiv:2601.00393, 2026

    Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, and Zhaoxiang Zhang. Neoverse: Enhancing 4d world model with in-the-wild monocular videos.arXiv preprint arXiv:2601.00393, 2026

  35. [35]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  36. [36]

    Uni4d: Unifying visual foundation models for 4d modeling from a single video

    David Yifan Yao, Albert J Zhai, and Shenlong Wang. Uni4d: Unifying visual foundation models for 4d modeling from a single video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1116–1126, 2025

  37. [37]

    DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

    Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023

  38. [38]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

  39. [39]

    Flextraj: Image-to-video generation with flexible point trajectory control.arXiv preprint arXiv:2510.08527, 2025

    Zhiyuan Zhang, Can Wang, Dongdong Chen, and Jing Liao. Flextraj: Image-to-video generation with flexible point trajectory control.arXiv preprint arXiv:2510.08527, 2025

  40. [40]

    Motionpro: A precise motion controller for image-to-video generation

    Zhongwei Zhang, Fuchen Long, Zhaofan Qiu, Yingwei Pan, Wu Liu, Ting Yao, and Tao Mei. Motionpro: A precise motion controller for image-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27957–27967, 2025

  41. [41]

    billboard

    Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, and Yanwei Fu. Ver- secrafter: Dynamic realistic video world model with 4d geometric control.arXiv preprint arXiv:2601.05138, 2026

  42. [42]

    Overall Score

    Shangchen Zhou, Chongyi Li, Kelvin CK Chan, and Chen Change Loy. Propainter: Improving propagation and transformer for video inpainting. InProceedings of the IEEE/CVF international conference on computer vision, pages 10477–10486, 2023. 12 A Evaluation Metrics A.1 PREBench Metrics PREBench evaluates faithful 4D video editing with region-aware metrics over...