arxiv: 2604.17565 · v2 · submitted 2026-04-19 · 💻 cs.CV

Recognition: no theorem link

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

Hong Jiang, Ruijie Quan, Wensong Song, Yi Yang, Zongxing Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords camera-controllable image editinggeometric consistencyvideo modelsnovel view synthesisgeometric guidancemulti-view alignmentdiffusion modelsimage editing

0 comments

The pith

Unifying geometric guidance at representation, architecture, and loss levels lets video models edit images under new camera poses with less drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Camera-controllable image editing requires synthesizing new views of a scene while preserving strict geometric consistency across those views. Existing methods rely on fragmented guidance and image-based models that operate on discrete mappings, which produces drift and degradation especially during continuous camera motion. The paper argues that video models supply useful continuous viewpoint priors but still need unified geometric guidance injected at the three levels that shape generative output. By adding a frame-decoupled reference mechanism, anchor attention for feature alignment, and endpoint supervision for structural fidelity, the approach claims to stabilize results. If the claim holds, novel-view editing becomes more reliable for tasks that demand consistent scene structure under varying viewpoints.

Core claim

The paper claims that fragmented geometric guidance is the root cause of instability in video-model-based camera-controllable editing and that injecting unified guidance at representation, architecture, and loss levels jointly resolves it. At the representation level a frame-decoupled geometric reference injection supplies cross-view context. At the architecture level geometric anchor attention aligns multi-view features. At the loss level a trajectory-endpoint supervision strategy explicitly reinforces structural fidelity of target views. Experiments across public benchmarks with both extensive and limited camera motion show the resulting outputs exceed prior methods in visual quality and,

What carries the argument

The three-level unified geometric guidance system that combines frame-decoupled reference injection for context, geometric anchor attention for feature alignment, and trajectory-endpoint supervision for fidelity.

If this is right

The unified approach outperforms existing methods on public benchmarks for both large and small camera motions.
Geometric drift and structural degradation are reduced under continuous camera movement.
Cross-view consistency is maintained more reliably because guidance acts at every level that shapes the output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-level unification pattern could be tested on other tasks that require multi-view consistency, such as video prediction or light-field rendering.
Extending the supervision to longer sequences would check whether the stability scales to extended camera paths not covered in current benchmarks.
Pairing the framework with real-time pose estimation could enable interactive editing sessions where users freely move the virtual camera.

Load-bearing premise

That fragmented guidance is the main driver of drift and that adding unified injections at precisely these three levels will stabilize output without creating fresh inconsistencies or demanding heavy retuning.

What would settle it

A controlled test on long or rapid camera trajectories where the three-level guidance still produces measurable geometric drift or structural degradation comparable to earlier methods.

Figures

Figures reproduced from arXiv: 2604.17565 by Hong Jiang, Ruijie Quan, Wensong Song, Yi Yang, Zongxing Yang.

**Figure 1.** Figure 1: Visual comparisons. Existing methods relying on fragmented geometric guidance often suffer from structural distortions or artifacts under camera motion (highlighted in red). In contrast, by enforcing unified geometric guidance, our UniGeo successfully preserves global scene geometry and structural fidelity (highlighted in green, with selected details enlarged). Abstract. Camera-controllable image editing a… view at source ↗

**Figure 2.** Figure 2: UniGeo Framework. UniGeo incorporates unified geometric guidance through: (a) Geometry Construction: Lifting input images into 3D point cloud sequences. (b) Frame-Decoupled Geometry Injection: Injecting sequences along the frame dimension. (c) Geometric Anchor Attention: Aligning cross-view features using first-frame tokens as anchors. (d) Trajectory-Endpoint Geometric Supervision: Applying higher loss wei… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison under the extensive camera motion setting. Compared with other methods, our approach better preserves the geometric structure of the scene under extensive camera motion, effectively avoiding structural duplication. 5.2 Comparisons with relevant methods Quantitative comparisons. We evaluate our method against CameraCtrl [30], MotionCtrl [72], ViewCrafter [86], FlexWorld [18], and PE-… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison under the limited camera motion setting. Our method maintains stable spatial layouts and scene structural consistency across views, while better preserving fine-grained scene details. Input —————————————————— Intermediate View —————————————————— Result [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Our approach models continuous camera motion characteristics. Sequences are shown from left to right: the input image (blue), intermediate frames reflecting the trajectory (red), and the final novel view (green). how our model smoothly and accurately models the continuous geometric transformations dictated by the camera motion. By maintaining structural coherence throughout the intermediate process, our a… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison on the MannequinChallenge dataset. Under camera motion, our method achieves more stable identity preservation compared with other methods, maintaining more consistent appearance [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results of the ablation study. Without point cloud or intermediate supervision, the generated results suffer from object duplication, incorrect placement, and increased blur, leading to degraded geometric consistency. Input Ours Input Ours GT [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Failure cases: Left—complex objects challenge geometry and texture preservation; Right—extreme camera changes impede geometric consistency. reliable cross-view correspondences while ensuring structural integrity. Comprehensive experiments demonstrate that UniGeo consistently outperforms existing methods in both geometric reliability and visual quality, providing a principled and effective solution for hig… view at source ↗

read the original abstract

Camera-controllable image editing aims to synthesize novel views of a given scene under varying camera poses while strictly preserving cross-view geometric consistency. However, existing methods typically rely on fragmented geometric guidance, such as only injecting point clouds at the representation level despite models containing multiple levels, and are mainly based on image diffusion models that operate on discrete view mappings. These two limitations jointly lead to geometric drift and structural degradation under continuous camera motion. We observe that while leveraging video models provides continuous viewpoint priors for camera-controllable image editing, they still struggle to form stable geometric understanding if geometric guidance remains fragmented. To systematically address this, we inject unified geometric guidance across three levels that jointly determine the generative output: representation, architecture, and loss function. To this end, we propose UniGeo, a novel camera-controllable editing framework. Specifically, at the representation level, UniGeo incorporates a frame-decoupled geometric reference injection mechanism to provide robust cross-view geometry context. At the architecture level, it introduces geometric anchor attention to align multi-view features. At the loss function level, it proposes a trajectory-endpoint geometric supervision strategy to explicitly reinforce the structural fidelity of target views. Comprehensive experiments across multiple public benchmarks, encompassing both extensive and limited camera motion settings, demonstrate that UniGeo significantly outperforms existing methods in both visual quality and geometric consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniGeo adds three concrete mechanisms to inject unified geometry into video models for camera editing, but the abstract shows no metrics, baselines, or ablations, so it's unclear whether the unification itself drives any gains over a plain video backbone.

read the letter

The core idea here is straightforward: video models already give some continuity for camera motion, but the authors claim fragmented geometry inputs still cause drift, so they add frame-decoupled reference injection at the representation level, geometric anchor attention at the architecture level, and trajectory-endpoint supervision at the loss level. That three-part unification is the main novelty they put forward against prior image-diffusion approaches that only patch one part of the pipeline at a time. It is a clean way to organize the problem and the mechanisms are specific enough to be tried by others working on view-consistent editing. The paper does a reasonable job explaining why continuous viewpoint priors from video models are worth using instead of discrete image mappings. The stress-test note is fair on the evidence gap. The abstract asserts significant outperformance on public benchmarks for both visual quality and geometric consistency under limited and extensive motion, yet supplies no numbers, no baseline names, no error bars, and no ablation that isolates the video prior alone versus the full three-level setup. Without those controls it is impossible to tell whether the unification is load-bearing or whether the video backbone already mitigates most of the drift. If the full manuscript contains the tables and the missing ablations, that would change the picture; based on the provided text the central claim rests on unshown results. This work is aimed at people doing camera-controllable synthesis and novel-view editing in computer vision. A practitioner looking for practical tricks around attention alignment or endpoint losses might still find pieces useful even if the full unification does not stick. It is coherent enough on its own terms to deserve a serious referee rather than a desk reject, mainly so the experiments and controls can be checked in detail.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes UniGeo, a camera-controllable image editing framework that leverages video models to address geometric drift under continuous camera motion. It unifies geometric guidance at three levels: representation (via a frame-decoupled geometric reference injection mechanism), architecture (via geometric anchor attention), and loss function (via a trajectory-endpoint geometric supervision strategy). The paper claims this yields superior visual quality and geometric consistency compared to prior methods based on fragmented guidance and image diffusion models, supported by comprehensive experiments on public benchmarks covering extensive and limited camera motion settings.

Significance. If the empirical results hold, the work could advance camera-controllable editing by demonstrating how video priors can be stabilized through explicit multi-level geometric unification rather than relying on fragmented cues. The three concrete mechanisms (frame-decoupled reference injection, geometric anchor attention, and trajectory-endpoint supervision) represent specific, potentially reusable contributions that credit the authors for targeting the multi-level structure of generative models.

major comments (2)

[Abstract] Abstract: The claim that 'UniGeo significantly outperforms existing methods in both visual quality and geometric consistency' is presented without any quantitative metrics, baseline details, error bars, or ablation results, leaving the central performance claim without visible supporting evidence.
[Experiments] Experiments section: No ablation isolates the contribution of a video-model baseline using only representation-level injection against the full three-level UniGeo on the reported geometric consistency metrics. This is load-bearing for the claim that fragmented guidance remains the dominant failure mode and that joint injection at all three levels is required to avoid drift or new inconsistencies.

minor comments (1)

[Abstract] Abstract: The distinction between 'extensive and limited camera motion settings' is referenced but not defined with specific thresholds or examples, which could aid reader understanding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed review. We address the major comments point by point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'UniGeo significantly outperforms existing methods in both visual quality and geometric consistency' is presented without any quantitative metrics, baseline details, error bars, or ablation results, leaving the central performance claim without visible supporting evidence.

Authors: We agree that the abstract would benefit from brief supporting evidence to contextualize the performance claim. In the revised manuscript, we have updated the abstract to include key quantitative metrics (e.g., specific improvements in PSNR, SSIM, and geometric consistency scores) and a concise reference to the main baselines and experimental settings. Detailed tables with error bars, full ablations, and per-scenario breakdowns remain in the Experiments section, as they exceed the length constraints of an abstract while preserving its summary nature. revision: yes
Referee: [Experiments] Experiments section: No ablation isolates the contribution of a video-model baseline using only representation-level injection against the full three-level UniGeo on the reported geometric consistency metrics. This is load-bearing for the claim that fragmented guidance remains the dominant failure mode and that joint injection at all three levels is required to avoid drift or new inconsistencies.

Authors: This is a fair and substantive point. Our experiments compare UniGeo against prior fragmented-guidance methods (both image- and video-based) and include component-wise ablations, but we did not explicitly report a video-model baseline limited to representation-level injection evaluated on the geometric consistency metrics. To directly address the concern and reinforce the necessity of multi-level unification, we will add this ablation in the revised Experiments section, including quantitative results on the relevant metrics to show that representation-level injection alone is insufficient to prevent drift under continuous camera motion. revision: yes

Circularity Check

0 steps flagged

No circularity: new mechanisms validated on external benchmarks

full rationale

The paper proposes three distinct new components (frame-decoupled geometric reference injection, geometric anchor attention, and trajectory-endpoint supervision) to unify guidance across representation, architecture, and loss levels in a video model. These are introduced as original contributions rather than derived from or equivalent to prior inputs. Performance is assessed via comparisons on public benchmarks under varied camera motion settings, with no equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to the paper's own definitions or data subsets. The derivation chain remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The claim rests on the domain assumption that video models supply useful continuous viewpoint priors and on three newly introduced mechanisms whose effectiveness is asserted without external independent validation.

axioms (1)

domain assumption Video models provide continuous viewpoint priors that can be leveraged for camera-controllable editing.
Stated as an observation in the abstract that motivates the approach.

invented entities (3)

frame-decoupled geometric reference injection mechanism no independent evidence
purpose: Provide robust cross-view geometry context at the representation level
Newly proposed component without cited external evidence of prior use.
geometric anchor attention no independent evidence
purpose: Align multi-view features at the architecture level
Newly proposed attention module.
trajectory-endpoint geometric supervision strategy no independent evidence
purpose: Reinforce structural fidelity of target views at the loss level
New supervision strategy introduced in the paper.

pith-pipeline@v0.9.0 · 5548 in / 1380 out tokens · 97656 ms · 2026-05-12T02:48:20.207558+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · 21 internal anchors

[1]

Building Normalizing Flows with Stochastic Interpolants

Albergo, M.S., Vanden-Eijnden, E.: Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571 (2022)

work page internal anchor Pith review arXiv 2022
[2]

World Simulation with Video Foundation Models for Physical AI

Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025)

work page internal anchor Pith review arXiv 2025
[3]

In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)

Avrahami, O., Patashnik, O., Fried, O., Nemchinov, E., Aberman, K., Lischinski, D., Cohen-Or, D.: Stable flow: Vital layers for training-free image editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR). pp. 7877–7888 (June 2025)

work page 2025
[4]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22875–22889 (2025)

work page 2025
[5]

arXiv preprint arXiv:2407.12781 , year=

Bahmani, S., Skorokhodov, I., Siarohin, A., Menapace, W., Qian, G., Vasilkovsky, M., Lee, H.Y., Wang, C., Zou, J., Tagliasacchi, A., et al.: Vd3d: Taming large video diffusion transformers for 3d camera control. arXiv preprint arXiv:2407.12781 (2024)

work page arXiv 2024
[6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Bai, J., Xia, M., Fu, X., Wang, X., Mu, L., Cao, J., Liu, Z., Hu, H., Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)

work page 2025
[7]

arXiv preprint arXiv:2510.20385 (2025)

Bai, Y., Li, H., Huang, Q.: Positional encoding field. arXiv preprint arXiv:2510.20385 (2025)

work page arXiv 2025
[8]

In: European conference on computer vision

Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: Text- driven layered image and video editing. In: European conference on computer vision. pp. 707–723. Springer (2022)

work page 2022
[9]

IEEE transactions on pattern analysis and machine intelligence35(8), 1798–1828 (2013)

Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence35(8), 1798–1828 (2013)

work page 2013
[10]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023)

work page 2023
[12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Brack, M., Friedrich, F., Kornmeier, K., Tsaban, L., Schramowski, P., Kersting, K., Passos, A.: Ledits++: Limitless image editing using text-to-image models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8861–8870 (2024)

work page 2024
[13]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023)

work page 2023
[14]

OpenAI Blog1(8), 1 (2024) 16 H

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024) 16 H. Jiang et al

work page 2024
[15]

arXiv preprint arXiv:2601.18993 (2026)

Cao, W., Zhang, H., Tian, F., Wu, Y., Li, Y., Wang, S., Yu, N., Liu, Y.: Freeorbit4d: Training-free arbitrary camera redirection for monocular videos via geometry- complete 4d reconstruction. arXiv preprint arXiv:2601.18993 (2026)

work page arXiv 2026
[16]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Chan, E.R., Nagano, K., Chan, M.A., Bergman, A.W., Park, J.J., Levy, A., Aittala, M., De Mello, S., Karras, T., Wetzstein, G.: Generative novel view synthesis with 3d- aware diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4217–4229 (2023)

work page 2023
[17]

Blip3o-next: Next frontier of native image generation.arXiv preprint arXiv:2510.15857, 2025

Chen, J., Xue, L., Xu, Z., Pan, X., Yang, S., Qin, C., Yan, A., Zhou, H., Chen, Z., Huang, L., et al.: Blip3o-next: Next frontier of native image generation. arXiv preprint arXiv:2510.15857 (2025)

work page arXiv 2025
[18]

arXiv preprint arXiv:2503.13265 (2025)

Chen, L., Zhou, Z., Zhao, M., Wang, Y., Zhang, G., Huang, W., Sun, H., Wen, J.R., Li, C.: Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis. arXiv preprint arXiv:2503.13265 (2025)

work page arXiv 2025
[19]

Contributors, L.: Lightx2v: Light video generation inference framework.https: //github.com/ModelTC/lightx2v(2025)

work page 2025
[20]

arXiv preprint arXiv:2210.11427 (2022)

Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427 (2022)

work page arXiv 2022
[21]

Cui, Y., Chen, H., Deng, H., Huang, X., Li, X., Liu, J., Liu, Y., Luo, Z., Wang, J., Wang, W., Wang, Y., Wang, C., Zhang, F., Zhao, Y., Pan, T., Li, X., Hao, Z., Ma, W., Chen, Z., Ao, Y., Huang, T., Wang, Z., Wang, X.: Emu3.5: Native multimodal models are world learners (2025),https://arxiv.org/abs/2510.26583

work page arXiv 2025
[22]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Communications of the ACM55(10), 78–87 (2012)

Domingos, P.: A few useful things to know about machine learning. Communications of the ACM55(10), 78–87 (2012)

work page 2012
[24]

arXiv e-prints pp

Dong, H., Wang, W., Li, C., Lin, D.: Wan-alpha: High-quality text-to-video genera- tion with alpha channel. arXiv e-prints pp. arXiv–2509 (2025)

work page 2025
[25]

In: Forty-first international conference on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

work page 2024
[26]

Cat3d: Create any- thing in 3d with multi-view diffusion models,

Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314 (2024)

work page arXiv 2024
[27]

MIT Press (2016),http: //www.deeplearningbook.org

Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016),http: //www.deeplearningbook.org

work page 2016
[28]

In: European Conference on Computer Vision

Guo, Y., Yang, C., Rao, A., Agrawala, M., Lin, D., Dai, B.: Sparsectrl: Adding sparse controls to text-to-video diffusion models. In: European Conference on Computer Vision. pp. 330–348. Springer (2024)

work page 2024
[29]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- abling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101 (2024)

work page internal anchor Pith review arXiv 2024
[31]

Prompt-to-Prompt Image Editing with Cross Attention Control

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control.(2022). URL https://arxiv. org/abs/2208.016263, 3 (2022) Abbreviated paper title 17

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Advances in neural information processing systems30(2017)

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

work page 2017
[33]

Training-free camera control for video generation.arXiv preprint arXiv:2406.10126, 2024

Hou, C., Chen, Z.: Training-free camera control for video generation. arXiv preprint arXiv:2406.10126 (2024)

work page arXiv 2024
[34]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

Huang, Y., Huang, J., Liu, Y., Yan, M., Lv, J., Liu, J., Xiong, W., Zhang, H., Cao, L., Chen, S.: Diffusion model-based image editing: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

work page 2025
[35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Y., Xie, L., Wang, X., Yuan, Z., Cun, X., Ge, Y., Zhou, J., Dong, C., Huang, R., Zhang, R., et al.: Smartedit: Exploring complex instruction-based image editing with multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8362–8371 (2024)

work page 2024
[36]

In: The Twelfth International Conference on Learning Representations (2023)

Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Pnp inversion: Boosting diffusion-based editing with 3 lines of code. In: The Twelfth International Conference on Learning Representations (2023)

work page 2023
[37]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., Irani, M.: Imagic: Text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6007–6017 (2023)

work page 2023
[38]

ACM Transactions on Graphics (ToG)36(4), 1–13 (2017)

Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG)36(4), 1–13 (2017)

work page 2017
[39]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Advances in Neural Information Processing Systems37, 16240–16271 (2024)

Kuang, Z., Cai, S., He, H., Xu, Y., Li, H., Guibas, L.J., Wetzstein, G.: Collaborative video diffusion: Consistent multi-video generation with camera control. Advances in Neural Information Processing Systems37, 16240–16271 (2024)

work page 2024
[41]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kulikov, V., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion- free text-based editing using pre-trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19721–19730 (2025)

work page 2025
[42]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)

work page 2024
[43]

nature521(7553), 436–444 (2015)

LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature521(7553), 436–444 (2015)

work page 2015
[44]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., Freeman, W.T.: Learning the depths of moving people by watching frozen people. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4521–4530 (2019)

work page 2019
[45]

Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025

Li, Z., Liu, Z., Zhang, Q., Lin, B., Wu, F., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., et al.: Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888 (2025)

work page arXiv 2025
[46]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)

work page internal anchor Pith review arXiv 2025
[47]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22160–22169 (2024)

work page 2024
[48]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 18 H. Jiang et al

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

In: Proceedings of the IEEE/CVF international conference on computer vision

Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9298–9309 (2023)

work page 2023
[50]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Ma, B., Gao, H., Deng, H., Luo, Z., Huang, T., Tang, L., Wang, X.: You see it, you got it: Learning 3d creation on pose-free videos at scale. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2016–2029 (2025)

work page 2016
[53]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021)

work page internal anchor Pith review arXiv 2021
[54]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inver- sion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6038–6047 (2023)

work page 2023
[55]

In: ACM SIGGRAPH 2023 conference proceedings

Parmar, G., Kumar Singh, K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to-image translation. In: ACM SIGGRAPH 2023 conference proceedings. pp. 1–11 (2023)

work page 2023
[56]

Open-sora 2.0: Training a commercial-level video generation model in $200k

Peng, X., Zheng, Z., Shen, C., Young, T., Guo, X., Wang, B., Xu, H., Liu, H., Jiang, M., Li, W., Wang, Y., Ye, A., Ren, G., Ma, Q., Liang, W., Lian, X., Wu, X., Zhong, Y., Li, Z., Gong, C., Lei, G., Cheng, L., Zhang, L., Li, M., Zhang, R., Hu, S., Huang, S., Wang, X., Zhao, Y., Wang, Y., Wei, Z., You, Y.: Open-sora 2.0: Training a commercial-level video g...

work page arXiv 2025
[57]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ren, X., Shen, T., Huang, J., Ling, H., Lu, Y., Nimier-David, M., Müller, T., Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6121–6132 (2025)

work page 2025
[58]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Rotstein, N., Yona, G., Silver, D., Velich, R., Bensaïd, D., Kimmel, R.: Pathways on the image manifold: Image editing via video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7857–7866 (2025)

work page 2025
[59]

arXiv preprint arXiv:2410.10792 (2024)

Rout, L., Chen, Y., Ruiz, N., Caramanis, C., Shakkottai, S., Chu, W.S.: Semantic image inversion and editing using rectified stochastic differential equations. arXiv preprint arXiv:2410.10792 (2024)

work page arXiv 2024
[60]

Advances in Neural Information Processing Systems37, 80220– 80243 (2024)

Seo, J., Fukuda, K., Shibuya, T., Narihira, T., Murata, N., Hu, S., Lai, C.H., Kim, S., Mitsufuji, Y.: Genwarp: Single image to novel views with semantic-preserving generative warping. Advances in Neural Information Processing Systems37, 80220– 80243 (2024)

work page 2024
[61]

arXiv preprint arXiv:2310.15110 , year=

Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., Chen, L., Zeng, C., Su, H.: Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023)

work page arXiv 2023
[62]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[63]

arXiv preprint arXiv:2504.15009 (2025) Abbreviated paper title 19

Song, W., Jiang, H., Yang, Z., Quan, R., Yang, Y.: Insert anything: Image insertion via in-context editing in dit. arXiv preprint arXiv:2504.15009 (2025) Abbreviated paper title 19

work page arXiv 2025
[64]

arXiv preprint arXiv:2411.04928 (2024)

Sun, W., Chen, S., Liu, F., Chen, Z., Duan, Y., Zhang, J., Wang, Y.: Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928 (2024)

work page arXiv 2024
[65]

In: European Conference on Computer Vision

Van Hoorick, B., Wu, R., Ozguroglu, E., Sargent, K., Liu, R., Tokmakov, P., Dave, A., Zheng, C., Vondrick, C.: Generative camera dolly: Extreme monocular dynamic novel view synthesis. In: European Conference on Computer Vision. pp. 313–331. Springer (2024)

work page 2024
[67]

Wan: Open and Advanced Large-Scale Video Generative Models

Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.203143(4), 6 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Ovis-u1 technical report

Wang, G.H., Zhao, S., Zhang, X., Cao, L., Zhan, P., Duan, L., Lu, S., Fu, M., Zhao, J., Li, Y., Chen, Q.G.: Ovis-u1 technical report. arXiv preprint arXiv:2506.23044 (2025)

work page arXiv 2025
[69]

arXiv preprint arXiv:2411.04746 (2024)

Wang, J., Pu, J., Qi, Z., Guo, J., Ma, Y., Huang, N., Chen, Y., Li, X., Shan, Y.: Taming rectified flow for inversion and editing. arXiv preprint arXiv:2411.04746 (2024)

work page arXiv 2024
[70]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

work page 2025
[71]

IEEE transactions on image processing 13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)

work page 2004
[72]

In: ACM SIGGRAPH 2024 Conference Papers

Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Motionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

work page 2024
[73]

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

Chronoedit: Towards temporal reasoning for image editing and world simulation.arXiv preprint arXiv:2510.04290, 2025

Wu, J.Z., Ren, X., Shen, T., Cao, T., He, K., Lu, Y., Gao, R., Xie, E., Lan, S., Alvarez, J.M., et al.: Chronoedit: Towards temporal reasoning for image editing and world simulation. arXiv preprint arXiv:2510.04290 (2025)

work page arXiv 2025
[76]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wu, P., Zhu, K., Liu, Y., Zhao, L., Zhai, W., Cao, Y., Zha, Z.J.: Improved video vae for latent video diffusion model. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18124–18133 (2025) 20 H. Jiang et al

work page 2025
[77]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wu, R., Mildenhall, B., Henzler, P., Park, K., Gao, R., Watson, D., Srinivasan, P.P., Verbin, D., Barron, J.T., Poole, B., et al.: Reconfusion: 3d reconstruction with diffusion priors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21551–21561 (2024)

work page 2024
[78]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025)

work page 2025
[79]

Camco: Camera- controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024

Xu, D., Nie, W., Liu, C., Liu, S., Kautz, J., Wang, Z., Vahdat, A.: Camco: Camera-controllable 3d-consistent image-to-video generation. arXiv preprint arXiv:2406.02509 (2024)

work page arXiv 2024
[80]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xu, S., Huang, Y., Pan, J., Ma, Z., Chai, J.: Inversion-free image editing with language-guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9452–9461 (2024)

work page 2024
[81]

In: ACM SIGGRAPH 2024 Conference Papers

Yang, S., Hou, L., Huang, H., Ma, C., Wan, P., Zhang, D., Chen, X., Liao, J.: Direct-a-video: Customized video generation with user-directed camera movement and object motion. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–12 (2024)

work page 2024

Showing first 80 references.