pith. machine review for the scientific record. sign in

arxiv: 2604.17565 · v2 · submitted 2026-04-19 · 💻 cs.CV

Recognition: no theorem link

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

Hong Jiang, Ruijie Quan, Wensong Song, Yi Yang, Zongxing Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords camera-controllable image editinggeometric consistencyvideo modelsnovel view synthesisgeometric guidancemulti-view alignmentdiffusion modelsimage editing
0
0 comments X

The pith

Unifying geometric guidance at representation, architecture, and loss levels lets video models edit images under new camera poses with less drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Camera-controllable image editing requires synthesizing new views of a scene while preserving strict geometric consistency across those views. Existing methods rely on fragmented guidance and image-based models that operate on discrete mappings, which produces drift and degradation especially during continuous camera motion. The paper argues that video models supply useful continuous viewpoint priors but still need unified geometric guidance injected at the three levels that shape generative output. By adding a frame-decoupled reference mechanism, anchor attention for feature alignment, and endpoint supervision for structural fidelity, the approach claims to stabilize results. If the claim holds, novel-view editing becomes more reliable for tasks that demand consistent scene structure under varying viewpoints.

Core claim

The paper claims that fragmented geometric guidance is the root cause of instability in video-model-based camera-controllable editing and that injecting unified guidance at representation, architecture, and loss levels jointly resolves it. At the representation level a frame-decoupled geometric reference injection supplies cross-view context. At the architecture level geometric anchor attention aligns multi-view features. At the loss level a trajectory-endpoint supervision strategy explicitly reinforces structural fidelity of target views. Experiments across public benchmarks with both extensive and limited camera motion show the resulting outputs exceed prior methods in visual quality and,

What carries the argument

The three-level unified geometric guidance system that combines frame-decoupled reference injection for context, geometric anchor attention for feature alignment, and trajectory-endpoint supervision for fidelity.

If this is right

  • The unified approach outperforms existing methods on public benchmarks for both large and small camera motions.
  • Geometric drift and structural degradation are reduced under continuous camera movement.
  • Cross-view consistency is maintained more reliably because guidance acts at every level that shapes the output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-level unification pattern could be tested on other tasks that require multi-view consistency, such as video prediction or light-field rendering.
  • Extending the supervision to longer sequences would check whether the stability scales to extended camera paths not covered in current benchmarks.
  • Pairing the framework with real-time pose estimation could enable interactive editing sessions where users freely move the virtual camera.

Load-bearing premise

That fragmented guidance is the main driver of drift and that adding unified injections at precisely these three levels will stabilize output without creating fresh inconsistencies or demanding heavy retuning.

What would settle it

A controlled test on long or rapid camera trajectories where the three-level guidance still produces measurable geometric drift or structural degradation comparable to earlier methods.

Figures

Figures reproduced from arXiv: 2604.17565 by Hong Jiang, Ruijie Quan, Wensong Song, Yi Yang, Zongxing Yang.

Figure 1
Figure 1. Figure 1: Visual comparisons. Existing methods relying on fragmented geometric guidance often suffer from structural distortions or artifacts under camera motion (highlighted in red). In contrast, by enforcing unified geometric guidance, our UniGeo successfully preserves global scene geometry and structural fidelity (highlighted in green, with selected details enlarged). Abstract. Camera-controllable image editing a… view at source ↗
Figure 2
Figure 2. Figure 2: UniGeo Framework. UniGeo incorporates unified geometric guidance through: (a) Geometry Construction: Lifting input images into 3D point cloud sequences. (b) Frame-Decoupled Geometry Injection: Injecting sequences along the frame dimension. (c) Geometric Anchor Attention: Aligning cross-view features using first-frame tokens as anchors. (d) Trajectory-Endpoint Geometric Supervision: Applying higher loss wei… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison under the extensive camera motion setting. Com￾pared with other methods, our approach better preserves the geometric structure of the scene under extensive camera motion, effectively avoiding structural duplication. 5.2 Comparisons with relevant methods Quantitative comparisons. We evaluate our method against CameraCtrl [30], MotionCtrl [72], ViewCrafter [86], FlexWorld [18], and PE-… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison under the limited camera motion setting. Our method maintains stable spatial layouts and scene structural consistency across views, while better preserving fine-grained scene details. Input —————————————————— Intermediate View —————————————————— Result [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Our approach models continuous camera motion characteristics. Sequences are shown from left to right: the input image (blue), intermediate frames reflecting the trajectory (red), and the final novel view (green). how our model smoothly and accurately models the continuous geometric trans￾formations dictated by the camera motion. By maintaining structural coherence throughout the intermediate process, our a… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison on the MannequinChallenge dataset. Under camera motion, our method achieves more stable identity preservation compared with other methods, maintaining more consistent appearance [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of the ablation study. Without point cloud or intermediate supervision, the generated results suffer from object duplication, incorrect placement, and increased blur, leading to degraded geometric consistency. Input Ours Input Ours GT [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Failure cases: Left—complex objects challenge geometry and texture preservation; Right—extreme camera changes impede geometric consistency. reliable cross-view correspondences while ensuring structural integrity. Compre￾hensive experiments demonstrate that UniGeo consistently outperforms existing methods in both geometric reliability and visual quality, providing a principled and effective solution for hig… view at source ↗
read the original abstract

Camera-controllable image editing aims to synthesize novel views of a given scene under varying camera poses while strictly preserving cross-view geometric consistency. However, existing methods typically rely on fragmented geometric guidance, such as only injecting point clouds at the representation level despite models containing multiple levels, and are mainly based on image diffusion models that operate on discrete view mappings. These two limitations jointly lead to geometric drift and structural degradation under continuous camera motion. We observe that while leveraging video models provides continuous viewpoint priors for camera-controllable image editing, they still struggle to form stable geometric understanding if geometric guidance remains fragmented. To systematically address this, we inject unified geometric guidance across three levels that jointly determine the generative output: representation, architecture, and loss function. To this end, we propose UniGeo, a novel camera-controllable editing framework. Specifically, at the representation level, UniGeo incorporates a frame-decoupled geometric reference injection mechanism to provide robust cross-view geometry context. At the architecture level, it introduces geometric anchor attention to align multi-view features. At the loss function level, it proposes a trajectory-endpoint geometric supervision strategy to explicitly reinforce the structural fidelity of target views. Comprehensive experiments across multiple public benchmarks, encompassing both extensive and limited camera motion settings, demonstrate that UniGeo significantly outperforms existing methods in both visual quality and geometric consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes UniGeo, a camera-controllable image editing framework that leverages video models to address geometric drift under continuous camera motion. It unifies geometric guidance at three levels: representation (via a frame-decoupled geometric reference injection mechanism), architecture (via geometric anchor attention), and loss function (via a trajectory-endpoint geometric supervision strategy). The paper claims this yields superior visual quality and geometric consistency compared to prior methods based on fragmented guidance and image diffusion models, supported by comprehensive experiments on public benchmarks covering extensive and limited camera motion settings.

Significance. If the empirical results hold, the work could advance camera-controllable editing by demonstrating how video priors can be stabilized through explicit multi-level geometric unification rather than relying on fragmented cues. The three concrete mechanisms (frame-decoupled reference injection, geometric anchor attention, and trajectory-endpoint supervision) represent specific, potentially reusable contributions that credit the authors for targeting the multi-level structure of generative models.

major comments (2)
  1. [Abstract] Abstract: The claim that 'UniGeo significantly outperforms existing methods in both visual quality and geometric consistency' is presented without any quantitative metrics, baseline details, error bars, or ablation results, leaving the central performance claim without visible supporting evidence.
  2. [Experiments] Experiments section: No ablation isolates the contribution of a video-model baseline using only representation-level injection against the full three-level UniGeo on the reported geometric consistency metrics. This is load-bearing for the claim that fragmented guidance remains the dominant failure mode and that joint injection at all three levels is required to avoid drift or new inconsistencies.
minor comments (1)
  1. [Abstract] Abstract: The distinction between 'extensive and limited camera motion settings' is referenced but not defined with specific thresholds or examples, which could aid reader understanding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed review. We address the major comments point by point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'UniGeo significantly outperforms existing methods in both visual quality and geometric consistency' is presented without any quantitative metrics, baseline details, error bars, or ablation results, leaving the central performance claim without visible supporting evidence.

    Authors: We agree that the abstract would benefit from brief supporting evidence to contextualize the performance claim. In the revised manuscript, we have updated the abstract to include key quantitative metrics (e.g., specific improvements in PSNR, SSIM, and geometric consistency scores) and a concise reference to the main baselines and experimental settings. Detailed tables with error bars, full ablations, and per-scenario breakdowns remain in the Experiments section, as they exceed the length constraints of an abstract while preserving its summary nature. revision: yes

  2. Referee: [Experiments] Experiments section: No ablation isolates the contribution of a video-model baseline using only representation-level injection against the full three-level UniGeo on the reported geometric consistency metrics. This is load-bearing for the claim that fragmented guidance remains the dominant failure mode and that joint injection at all three levels is required to avoid drift or new inconsistencies.

    Authors: This is a fair and substantive point. Our experiments compare UniGeo against prior fragmented-guidance methods (both image- and video-based) and include component-wise ablations, but we did not explicitly report a video-model baseline limited to representation-level injection evaluated on the geometric consistency metrics. To directly address the concern and reinforce the necessity of multi-level unification, we will add this ablation in the revised Experiments section, including quantitative results on the relevant metrics to show that representation-level injection alone is insufficient to prevent drift under continuous camera motion. revision: yes

Circularity Check

0 steps flagged

No circularity: new mechanisms validated on external benchmarks

full rationale

The paper proposes three distinct new components (frame-decoupled geometric reference injection, geometric anchor attention, and trajectory-endpoint supervision) to unify guidance across representation, architecture, and loss levels in a video model. These are introduced as original contributions rather than derived from or equivalent to prior inputs. Performance is assessed via comparisons on public benchmarks under varied camera motion settings, with no equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to the paper's own definitions or data subsets. The derivation chain remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The claim rests on the domain assumption that video models supply useful continuous viewpoint priors and on three newly introduced mechanisms whose effectiveness is asserted without external independent validation.

axioms (1)
  • domain assumption Video models provide continuous viewpoint priors that can be leveraged for camera-controllable editing.
    Stated as an observation in the abstract that motivates the approach.
invented entities (3)
  • frame-decoupled geometric reference injection mechanism no independent evidence
    purpose: Provide robust cross-view geometry context at the representation level
    Newly proposed component without cited external evidence of prior use.
  • geometric anchor attention no independent evidence
    purpose: Align multi-view features at the architecture level
    Newly proposed attention module.
  • trajectory-endpoint geometric supervision strategy no independent evidence
    purpose: Reinforce structural fidelity of target views at the loss level
    New supervision strategy introduced in the paper.

pith-pipeline@v0.9.0 · 5548 in / 1380 out tokens · 97656 ms · 2026-05-12T02:48:20.207558+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · 21 internal anchors

  1. [1]

    Building Normalizing Flows with Stochastic Interpolants

    Albergo, M.S., Vanden-Eijnden, E.: Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571 (2022)

  2. [2]

    World Simulation with Video Foundation Models for Physical AI

    Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025)

  3. [3]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)

    Avrahami, O., Patashnik, O., Fried, O., Nemchinov, E., Aberman, K., Lischinski, D., Cohen-Or, D.: Stable flow: Vital layers for training-free image editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR). pp. 7877–7888 (June 2025)

  4. [4]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22875–22889 (2025)

  5. [5]

    arXiv preprint arXiv:2407.12781 , year=

    Bahmani, S., Skorokhodov, I., Siarohin, A., Menapace, W., Qian, G., Vasilkovsky, M., Lee, H.Y., Wang, C., Zou, J., Tagliasacchi, A., et al.: Vd3d: Taming large video diffusion transformers for 3d camera control. arXiv preprint arXiv:2407.12781 (2024)

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Bai, J., Xia, M., Fu, X., Wang, X., Mu, L., Cao, J., Liu, Z., Hu, H., Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)

  7. [7]

    arXiv preprint arXiv:2510.20385 (2025)

    Bai, Y., Li, H., Huang, Q.: Positional encoding field. arXiv preprint arXiv:2510.20385 (2025)

  8. [8]

    In: European conference on computer vision

    Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: Text- driven layered image and video editing. In: European conference on computer vision. pp. 707–723. Springer (2022)

  9. [9]

    IEEE transactions on pattern analysis and machine intelligence35(8), 1798–1828 (2013)

    Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence35(8), 1798–1828 (2013)

  10. [10]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  11. [11]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023)

  12. [12]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Brack, M., Friedrich, F., Kornmeier, K., Tsaban, L., Schramowski, P., Kersting, K., Passos, A.: Ledits++: Limitless image editing using text-to-image models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8861–8870 (2024)

  13. [13]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023)

  14. [14]

    OpenAI Blog1(8), 1 (2024) 16 H

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024) 16 H. Jiang et al

  15. [15]

    arXiv preprint arXiv:2601.18993 (2026)

    Cao, W., Zhang, H., Tian, F., Wu, Y., Li, Y., Wang, S., Yu, N., Liu, Y.: Freeorbit4d: Training-free arbitrary camera redirection for monocular videos via geometry- complete 4d reconstruction. arXiv preprint arXiv:2601.18993 (2026)

  16. [16]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Chan, E.R., Nagano, K., Chan, M.A., Bergman, A.W., Park, J.J., Levy, A., Aittala, M., De Mello, S., Karras, T., Wetzstein, G.: Generative novel view synthesis with 3d- aware diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4217–4229 (2023)

  17. [17]

    Blip3o-next: Next frontier of native image generation.arXiv preprint arXiv:2510.15857, 2025

    Chen, J., Xue, L., Xu, Z., Pan, X., Yang, S., Qin, C., Yan, A., Zhou, H., Chen, Z., Huang, L., et al.: Blip3o-next: Next frontier of native image generation. arXiv preprint arXiv:2510.15857 (2025)

  18. [18]

    arXiv preprint arXiv:2503.13265 (2025)

    Chen, L., Zhou, Z., Zhao, M., Wang, Y., Zhang, G., Huang, W., Sun, H., Wen, J.R., Li, C.: Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis. arXiv preprint arXiv:2503.13265 (2025)

  19. [19]

    Contributors, L.: Lightx2v: Light video generation inference framework.https: //github.com/ModelTC/lightx2v(2025)

  20. [20]

    arXiv preprint arXiv:2210.11427 (2022)

    Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427 (2022)

  21. [21]

    Cui, Y., Chen, H., Deng, H., Huang, X., Li, X., Liu, J., Liu, Y., Luo, Z., Wang, J., Wang, W., Wang, Y., Wang, C., Zhang, F., Zhao, Y., Pan, T., Li, X., Hao, Z., Ma, W., Chen, Z., Ao, Y., Huang, T., Wang, Z., Wang, X.: Emu3.5: Native multimodal models are world learners (2025),https://arxiv.org/abs/2510.26583

  22. [22]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)

  23. [23]

    Communications of the ACM55(10), 78–87 (2012)

    Domingos, P.: A few useful things to know about machine learning. Communications of the ACM55(10), 78–87 (2012)

  24. [24]

    arXiv e-prints pp

    Dong, H., Wang, W., Li, C., Lin, D.: Wan-alpha: High-quality text-to-video genera- tion with alpha channel. arXiv e-prints pp. arXiv–2509 (2025)

  25. [25]

    In: Forty-first international conference on machine learning (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

  26. [26]

    Cat3d: Create any- thing in 3d with multi-view diffusion models,

    Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314 (2024)

  27. [27]

    MIT Press (2016),http: //www.deeplearningbook.org

    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016),http: //www.deeplearningbook.org

  28. [28]

    In: European Conference on Computer Vision

    Guo, Y., Yang, C., Rao, A., Agrawala, M., Lin, D., Dai, B.: Sparsectrl: Adding sparse controls to text-to-video diffusion models. In: European Conference on Computer Vision. pp. 330–348. Springer (2024)

  29. [29]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

  30. [30]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- abling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101 (2024)

  31. [31]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control.(2022). URL https://arxiv. org/abs/2208.016263, 3 (2022) Abbreviated paper title 17

  32. [32]

    Advances in neural information processing systems30(2017)

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  33. [33]

    Training-free camera control for video generation.arXiv preprint arXiv:2406.10126, 2024

    Hou, C., Chen, Z.: Training-free camera control for video generation. arXiv preprint arXiv:2406.10126 (2024)

  34. [34]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    Huang, Y., Huang, J., Liu, Y., Yan, M., Lv, J., Liu, J., Xiong, W., Zhang, H., Cao, L., Chen, S.: Diffusion model-based image editing: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  35. [35]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Y., Xie, L., Wang, X., Yuan, Z., Cun, X., Ge, Y., Zhou, J., Dong, C., Huang, R., Zhang, R., et al.: Smartedit: Exploring complex instruction-based image editing with multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8362–8371 (2024)

  36. [36]

    In: The Twelfth International Conference on Learning Representations (2023)

    Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Pnp inversion: Boosting diffusion-based editing with 3 lines of code. In: The Twelfth International Conference on Learning Representations (2023)

  37. [37]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., Irani, M.: Imagic: Text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6007–6017 (2023)

  38. [38]

    ACM Transactions on Graphics (ToG)36(4), 1–13 (2017)

    Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG)36(4), 1–13 (2017)

  39. [39]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

  40. [40]

    Advances in Neural Information Processing Systems37, 16240–16271 (2024)

    Kuang, Z., Cai, S., He, H., Xu, Y., Li, H., Guibas, L.J., Wetzstein, G.: Collaborative video diffusion: Consistent multi-video generation with camera control. Advances in Neural Information Processing Systems37, 16240–16271 (2024)

  41. [41]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Kulikov, V., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion- free text-based editing using pre-trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19721–19730 (2025)

  42. [42]

    Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)

  43. [43]

    nature521(7553), 436–444 (2015)

    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature521(7553), 436–444 (2015)

  44. [44]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., Freeman, W.T.: Learning the depths of moving people by watching frozen people. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4521–4530 (2019)

  45. [45]

    Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025

    Li, Z., Liu, Z., Zhang, Q., Lin, B., Wu, F., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., et al.: Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888 (2025)

  46. [46]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)

  47. [47]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22160–22169 (2024)

  48. [48]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 18 H. Jiang et al

  49. [49]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9298–9309 (2023)

  50. [50]

    Step1X-Edit: A Practical Framework for General Image Editing

    Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025)

  51. [51]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

  52. [52]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Ma, B., Gao, H., Deng, H., Luo, Z., Huang, T., Tang, L., Wang, X.: You see it, you got it: Learning 3d creation on pose-free videos at scale. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2016–2029 (2025)

  53. [53]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021)

  54. [54]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inver- sion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6038–6047 (2023)

  55. [55]

    In: ACM SIGGRAPH 2023 conference proceedings

    Parmar, G., Kumar Singh, K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to-image translation. In: ACM SIGGRAPH 2023 conference proceedings. pp. 1–11 (2023)

  56. [56]

    Open-sora 2.0: Training a commercial-level video generation model in $200k

    Peng, X., Zheng, Z., Shen, C., Young, T., Guo, X., Wang, B., Xu, H., Liu, H., Jiang, M., Li, W., Wang, Y., Ye, A., Ren, G., Ma, Q., Liang, W., Lian, X., Wu, X., Zhong, Y., Li, Z., Gong, C., Lei, G., Cheng, L., Zhang, L., Li, M., Zhang, R., Hu, S., Huang, S., Wang, X., Zhao, Y., Wang, Y., Wei, Z., You, Y.: Open-sora 2.0: Training a commercial-level video g...

  57. [57]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ren, X., Shen, T., Huang, J., Ling, H., Lu, Y., Nimier-David, M., Müller, T., Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6121–6132 (2025)

  58. [58]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Rotstein, N., Yona, G., Silver, D., Velich, R., Bensaïd, D., Kimmel, R.: Pathways on the image manifold: Image editing via video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7857–7866 (2025)

  59. [59]

    arXiv preprint arXiv:2410.10792 (2024)

    Rout, L., Chen, Y., Ruiz, N., Caramanis, C., Shakkottai, S., Chu, W.S.: Semantic image inversion and editing using rectified stochastic differential equations. arXiv preprint arXiv:2410.10792 (2024)

  60. [60]

    Advances in Neural Information Processing Systems37, 80220– 80243 (2024)

    Seo, J., Fukuda, K., Shibuya, T., Narihira, T., Murata, N., Hu, S., Lai, C.H., Kim, S., Mitsufuji, Y.: Genwarp: Single image to novel views with semantic-preserving generative warping. Advances in Neural Information Processing Systems37, 80220– 80243 (2024)

  61. [61]

    arXiv preprint arXiv:2310.15110 , year=

    Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., Chen, L., Zeng, C., Su, H.: Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023)

  62. [62]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  63. [63]

    arXiv preprint arXiv:2504.15009 (2025) Abbreviated paper title 19

    Song, W., Jiang, H., Yang, Z., Quan, R., Yang, Y.: Insert anything: Image insertion via in-context editing in dit. arXiv preprint arXiv:2504.15009 (2025) Abbreviated paper title 19

  64. [64]

    arXiv preprint arXiv:2411.04928 (2024)

    Sun, W., Chen, S., Liu, F., Chen, Z., Duan, Y., Zhang, J., Wang, Y.: Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928 (2024)

  65. [65]

    In: European Conference on Computer Vision

    Van Hoorick, B., Wu, R., Ozguroglu, E., Sargent, K., Liu, R., Tokmakov, P., Dave, A., Zheng, C., Vondrick, C.: Generative camera dolly: Extreme monocular dynamic novel view synthesis. In: European Conference on Computer Vision. pp. 313–331. Springer (2024)

  66. [67]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.203143(4), 6 (2025)

  67. [68]

    Ovis-u1 technical report

    Wang, G.H., Zhao, S., Zhang, X., Cao, L., Zhan, P., Duan, L., Lu, S., Fu, M., Zhao, J., Li, Y., Chen, Q.G.: Ovis-u1 technical report. arXiv preprint arXiv:2506.23044 (2025)

  68. [69]

    arXiv preprint arXiv:2411.04746 (2024)

    Wang, J., Pu, J., Qi, Z., Guo, J., Ma, Y., Huang, N., Chen, Y., Li, X., Shan, Y.: Taming rectified flow for inversion and editing. arXiv preprint arXiv:2411.04746 (2024)

  69. [70]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

  70. [71]

    IEEE transactions on image processing 13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)

  71. [72]

    In: ACM SIGGRAPH 2024 Conference Papers

    Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Motionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

  72. [73]

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

  73. [74]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025)

  74. [75]

    Chronoedit: Towards temporal reasoning for image editing and world simulation.arXiv preprint arXiv:2510.04290, 2025

    Wu, J.Z., Ren, X., Shen, T., Cao, T., He, K., Lu, Y., Gao, R., Xie, E., Lan, S., Alvarez, J.M., et al.: Chronoedit: Towards temporal reasoning for image editing and world simulation. arXiv preprint arXiv:2510.04290 (2025)

  75. [76]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wu, P., Zhu, K., Liu, Y., Zhao, L., Zhai, W., Cao, Y., Zha, Z.J.: Improved video vae for latent video diffusion model. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18124–18133 (2025) 20 H. Jiang et al

  76. [77]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wu, R., Mildenhall, B., Henzler, P., Park, K., Gao, R., Watson, D., Srinivasan, P.P., Verbin, D., Barron, J.T., Poole, B., et al.: Reconfusion: 3d reconstruction with diffusion priors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21551–21561 (2024)

  77. [78]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025)

  78. [79]

    Camco: Camera- controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024

    Xu, D., Nie, W., Liu, C., Liu, S., Kautz, J., Wang, Z., Vahdat, A.: Camco: Camera-controllable 3d-consistent image-to-video generation. arXiv preprint arXiv:2406.02509 (2024)

  79. [80]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xu, S., Huang, Y., Pan, J., Ma, Z., Chai, J.: Inversion-free image editing with language-guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9452–9461 (2024)

  80. [81]

    In: ACM SIGGRAPH 2024 Conference Papers

    Yang, S., Hou, L., Huang, H., Ma, C., Wan, P., Zhang, D., Chen, X., Liao, J.: Direct-a-video: Customized video generation with user-directed camera movement and object motion. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–12 (2024)

Showing first 80 references.