GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

Fredrik Kahl; Josef Bengtson; Yaroslava Lochman

arxiv: 2606.05142 · v1 · pith:7I4MJDXPnew · submitted 2026-06-03 · 💻 cs.CV · cs.AI

GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

Josef Bengtson , Yaroslava Lochman , Fredrik Kahl This is my paper

Pith reviewed 2026-06-28 06:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multi-view image editingnonrigid editsgeometry-aware editingdepth map alignmentpoint cloud alignmenttraining-free method3D consistencygenerative image editing

0 comments

The pith

GeM-NR aligns depth-derived point clouds to propagate nonrigid edits consistently across multiple scene views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training-free pipeline that starts from one edited anchor view and produces matching edits on other views even when the edit substantially alters object shapes and textures. It estimates depth for both scenes, aligns their 3D point clouds to recover correspondences, projects the edit into the target camera, and refines the result while conditioning on the original unedited image. Existing techniques are restricted to rigid structure preservation or appearance-only changes, so this approach widens the range of feasible 3D-aware edits without task-specific retraining.

Core claim

GeM-NR is a fast and flexible training-free approach for general multi-view consistent image editing, including edits that drastically change the geometry and appearance of the scene. Given an anchor image edited with a chosen backbone editor and a query unedited image, GeM-NR edits the query image consistently with the anchor edit. The method incorporates multiple stages: depth map estimation with a strategy to maximize the alignment between the 3D point clouds of the edited and unedited scenes, projection onto a query viewpoint, and refinement of the obtained image conditioned on the unedited query. The conditioning-based formulation scales well from two to many views of an object.

What carries the argument

Depth map estimation followed by point-cloud alignment that maximizes 3D correspondence between edited and original scenes, followed by projection and conditioned refinement.

If this is right

The method produces consistent edits for tasks that substantially alter scene geometry and appearance, where prior approaches fail.
Quantitative and qualitative evaluations show state-of-the-art performance on edit quality together with geometric and photometric consistency across views.
The same pipeline supports generation of 3D representations from the edited multi-view set.
The conditioning formulation extends naturally from pairs of views to larger numbers of viewpoints without additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the alignment procedure remains stable under larger viewpoint gaps, the same stages could be applied to video sequences with moving cameras.
Replacing the backbone editor with newer generative models would immediately widen the variety of nonrigid edits the pipeline can accept.
The point-cloud alignment objective might be reused as a consistency regularizer inside other multi-view reconstruction pipelines.

Load-bearing premise

The point-cloud alignment step recovers accurate 3D correspondences even after nonrigid geometry changes have been applied to the scene.

What would settle it

Multi-view test cases in which a nonrigid edit produces large mismatches between the edited and original point clouds, resulting in visible geometric inconsistencies or failed refinement across views.

Figures

Figures reproduced from arXiv: 2606.05142 by Fredrik Kahl, Josef Bengtson, Yaroslava Lochman.

**Figure 1.** Figure 1: Depth Anything 3 is trained to tackle dynamic scene reconstruction, and we leverage this ability for challenging nonrigid edits. The model views {Asrc, Aedited, Qsrc} as images of a single dynamic scene, where possible changes in scene geometry and photometry over time appear at Aedited. Edit initialization with warping A partially-filled edited image is rendered at a query viewpoint from projecting the po… view at source ↗

**Figure 9.** Figure 9: Qualitative image pair editing examples for Edicho and our GeM-NR. "A tiger plushie” Unedited Edicho Ours "A berry bowl with a floral pattern” Unedited Edicho Ours [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Most existing works focus on rigid or appearance-only edits by utilizing the geometry of the unedited scene. This naturally limits these methods to edits that preserve the underlying scene structure. Other approaches are trained for specific image editing tasks, such as object removal and addition. Despite this progress, general nonrigid edits, i.e., edits that substantially change the scene geometry, remain challenging for existing methods. We propose GeM-NR, a fast and flexible training-free approach for general multi-view consistent image editing, including edits that drastically change the geometry and appearance of the scene. Given an anchor image edited with a chosen backbone editor (such as FLUX, Qwen, BrushNet) and a query unedited image, GeM-NR edits the query image consistently with the anchor edit. The method incorporates multiple stages: (i) depth map estimation, where we propose a strategy to maximize the alignment between the 3D point clouds of the edited and unedited scenes, (ii) projection onto a query viewpoint, and (iii) refinement of the obtained image conditioned on the unedited query. The conditioning-based formulation scales well from two to many views of an object. We demonstrate the ability of our method to handle edits with significant changes in geometry and appearance, something that existing methods struggle with. We perform an extensive evaluation showing that our method improves consistency for a wide variety of edit tasks, including generating 3D representations of the edited scene. Both quantitative and qualitative results indicate the state-of-the-art performance of our method in terms of edit quality as well as geometric and photometric consistency across multiple views.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeM-NR tries to fix multi-view consistency for nonrigid edits via point-cloud alignment on depth maps, but the alignment step is too vaguely described and the SOTA claims lack visible support.

read the letter

The paper's main move is to take an edited anchor view from an off-the-shelf generator, estimate depths on both the edited anchor and an unedited query view, align the resulting point clouds to transfer the edit, project it, and then refine the projected image with conditioning on the original query. This is positioned as a training-free way to handle edits that change geometry substantially, unlike prior work limited to rigid or appearance-only changes.

The approach has a clear practical goal: plug into existing editors like FLUX or BrushNet and get multi-view consistency without task-specific retraining. The conditioning refinement that scales from two to many views is a sensible engineering choice if the projection step supplies a decent starting point.

The soft spot is exactly where the stress-test note flags it. The alignment is described only as "a strategy to maximize the alignment between the 3D point clouds," with no objective function, deformation model, optimizer, or handling for non-isometric or topology-changing edits. Since the edit is nonrigid by design, a rigid or near-rigid alignment will leave residuals that feed straight into the projection. The abstract claims state-of-the-art geometric and photometric consistency plus extensive quantitative and qualitative results, yet supplies no tables, datasets, metrics, ablations, or failure cases. Without those, it is impossible to tell whether the consistency comes from the alignment or from easy test cases and the backbone model.

This is relevant to people building 3D content pipelines who need a general consistency layer rather than per-task models. A reader who can see the full evaluation section might extract useful implementation details. The work deserves peer review so referees can check whether the alignment actually delivers on the claimed class of edits and whether the numbers back the claims.

Referee Report

2 major / 2 minor

Summary. The paper proposes GeM-NR, a training-free pipeline for multi-view consistent image editing under nonrigid geometric and appearance changes. Given an anchor image edited by an off-the-shelf backbone (FLUX, Qwen, BrushNet) and an unedited query view, the method (i) estimates depth maps and aligns the resulting 3D point clouds, (ii) projects the edited content into the query viewpoint, and (iii) refines the projected image by conditioning a generator on the original query. The authors claim this yields state-of-the-art edit quality together with geometric and photometric consistency across views, including the ability to produce 3D representations of the edited scene.

Significance. A reliable, training-free method that genuinely supports large nonrigid edits while preserving multi-view consistency would be a notable contribution to 3D-aware image editing. The modular design (backbone editor + alignment + conditioning) is attractive for practical use. However, the central technical step—recovering usable correspondences after non-isometric geometry change—remains too vaguely described to assess whether the claimed consistency gains are attributable to the proposed alignment rather than to the backbone or to the choice of test cases.

major comments (2)

[depth-map estimation stage (§3)] Depth-map estimation stage (abstract and §3): the alignment between edited and unedited point clouds is described only as “a strategy to maximize the alignment.” No objective function, deformation model, rigidity or topology-change handling, optimizer, or convergence criterion is supplied. Because the edit is non-isometric by construction, any rigid or near-rigid alignment will leave large residuals that propagate directly into the projection step; without an explicit formulation it is impossible to verify that the claimed consistency for “significant changes in geometry” is achieved by the method rather than by easy cases or by the later refinement.
[evaluation section] Evaluation section: the abstract asserts quantitative SOTA results on geometric and photometric consistency, yet the provided description supplies neither the metrics, datasets, number of views, error bars, nor an ablation isolating the contribution of the point-cloud alignment. Without these data the central claim that GeM-NR outperforms prior multi-view editors on nonrigid edits cannot be evaluated.

minor comments (2)

[abstract] The scaling statement “from two to many views” would benefit from an explicit statement of the maximum number of views tested and any degradation observed.
[method] Notation for the three stages (i)–(iii) is introduced in the abstract but not carried forward with consistent symbols in the method description; adding equation numbers or algorithm pseudocode would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where the technical description and evaluation require greater clarity. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [depth-map estimation stage (§3)] Depth-map estimation stage (abstract and §3): the alignment between edited and unedited point clouds is described only as “a strategy to maximize the alignment.” No objective function, deformation model, rigidity or topology-change handling, optimizer, or convergence criterion is supplied. Because the edit is non-isometric by construction, any rigid or near-rigid alignment will leave large residuals that propagate directly into the projection step; without an explicit formulation it is impossible to verify that the claimed consistency for “significant changes in geometry” is achieved by the method rather than by easy cases or by the later refinement.

Authors: We agree that the alignment procedure in §3 is described at too high a level. The current manuscript refers only to “a strategy to maximize the alignment” without supplying the objective function, deformation model, rigidity assumptions, topology handling, optimizer, or convergence criteria. In the revised manuscript we will expand §3 with the explicit formulation of the point-cloud alignment, including the objective, the deformation model and its capacity to accommodate non-isometric changes, the optimizer, and convergence criterion. This will make it possible to evaluate whether the reported consistency gains for large geometric edits are attributable to the alignment step. revision: yes
Referee: [evaluation section] Evaluation section: the abstract asserts quantitative SOTA results on geometric and photometric consistency, yet the provided description supplies neither the metrics, datasets, number of views, error bars, nor an ablation isolating the contribution of the point-cloud alignment. Without these data the central claim that GeM-NR outperforms prior multi-view editors on nonrigid edits cannot be evaluated.

Authors: We acknowledge that the evaluation section must be expanded to support the quantitative claims. The manuscript states that an extensive evaluation was performed, but does not currently enumerate the concrete metrics, datasets, view counts, error bars, or ablations isolating the alignment module. In the revised version we will add these elements: explicit definitions of the geometric and photometric consistency metrics, the datasets and number of views used, error bars on all reported numbers, and an ablation that isolates the contribution of the point-cloud alignment step. This will allow direct assessment of the state-of-the-art claims. revision: yes

Circularity Check

0 steps flagged

No circularity; pipeline composes external depth estimators and generators without self-referential reductions

full rationale

The paper presents GeM-NR as a training-free composition of existing depth estimators, point-cloud alignment, projection, and conditioned refinement using backbone editors such as FLUX. No equations, fitted parameters, or predictions are defined in terms of themselves. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The alignment strategy is described at a high level without reducing to a tautology or fitted input. The derivation chain therefore remains self-contained against external components and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method implicitly relies on the accuracy of off-the-shelf depth estimators and the existence of a useful rigid alignment between edited and original geometry.

pith-pipeline@v0.9.1-grok · 5857 in / 1240 out tokens · 26881 ms · 2026-06-28T06:27:43.853478+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 20 canonical work pages · 6 internal anchors

[1]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Asim, M., Wewer, C., Wimmer, T., Schiele, B., Lenssen, J.E.: Met3r: Measur- ing multi-view consistency in generated images. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 6034–6044 (2025)

2025
[2]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Bai, Q., Ouyang, H., Xu, Y., Wang, Q., Yang, C., Cheng, K.L., Shen, Y., Chen, Q.: Edicho: Consistent image editing in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 15277–15287 (October 2025)

2025
[3]

Proceedings of the Computer Vision and Pattern Recognition Conference (2022)

Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-nerf 360: Unbounded anti-aliased neural radiance fields. Proceedings of the Computer Vision and Pattern Recognition Conference (2022)

2022
[4]

Bengtson, J., Nilsson, D., Lee, D.I., Lochman, Y., Kahl, F.: 3d-consistent multi- view editing by correspondence guidance (2026),https://arxiv.org/abs/2511. 22228

2026
[5]

In: Wallraven, C., Liu, C.L., Ross, A

Bengtson, J., Nilsson, D., Lin, C.T., Büsching, M., Kahl, F.: Adjustable visual ap- pearance for generalizable novel view synthesis. In: Wallraven, C., Liu, C.L., Ross, A. (eds.) Pattern Recognition and Artificial Intelligence. pp. 157–171. Springer Nature Singapore, Singapore (2025)

2025
[6]

Black Forest Labs: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/ flux-2(2025)

2025
[7]

ai / blog / flux2 - klein - towards - interactive - visual - intelligence (2025)

Black Forest Labs: FLUX.2 [klein]: Towards Interactive Visual Intelligence.https: / / bfl . ai / blog / flux2 - klein - towards - interactive - visual - intelligence (2025)

2025
[8]

In: Proceedings of the Computer Vision and Pattern Recog- nition Conference (2023)

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the Computer Vision and Pattern Recog- nition Conference (2023)

2023
[9]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Caron, M., Touvron, H., Misra, I., J’egou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 9630–9640 (2021), https://api.semanticscholar.org/CorpusID:233444273

2021
[10]

In: Proceedings of the 18 J

Chen, D.Y., Tennent, H., Hsu, C.W.: Artadapter: Text-to-image style transfer using multi-level style encoder and explicit adaptation. In: Proceedings of the 18 J. Bengtson, Y. Lochman, F. Kahl IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8619–8628 (June 2024)

2024
[11]

Chen,L.,Li,R.,Zhang,G.,Wang,P.,Zhang,L.:Fastmulti-viewconsistent3dedit- ing with video priors. Proceedings of the AAAI Conference on Artificial Intelligence 40(4), 2948–2956 (Mar 2026).https://doi.org/10.1609/aaai.v40i4.37286, https://ojs.aaai.org/index.php/AAAI/article/view/37286

work page doi:10.1609/aaai.v40i4.37286 2026
[12]

In: European Conference on Computer Vision

Chen, M., Laina, I., Vedaldi, A.: Dge: Direct gaussian 3d editing by consistent multi-view editing. In: European Conference on Computer Vision. pp. 74–92. Springer (2024)

2024
[13]

Chen, Y., Chen, Z., Zhang, C., Wang, F., Yang, X., Wang, Y., Cai, Z., Yang, L., Liu, H., Lin, G.: Gaussianeditor: Swift and controllable 3d editing with gaussian splatting (2023)

2023
[14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chung, J., Hyun, S., Heo, J.P.: Style injection in diffusion: A training-free approach for adapting large-scale diffusion models for style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8795–8805 (June 2024)

2024
[15]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., et. al., E.R.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities (2025),https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

In: Thirty-seventh Conference on Neural Information Processing Sys- tems (2023)

Dong, J., Wang, Y.X.: Vica-nerf: View-consistency-aware 3d editing of neural radi- ance fields. In: Thirty-seventh Conference on Neural Information Processing Sys- tems (2023)

2023
[17]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

Edstedt, J., Sun, Q., Bökman, G., Wadenbäck, M., Felsberg, M.: Roma: Robust dense feature matching. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 19790–19800 (2024)

2024
[18]

Erkoç, Z., Dai, A., Nießner, M.: Worldagents: Can foundation image models be agents for 3d world models? arXiv preprint arXiv:2603.19708 (2026)

work page arXiv 2026
[19]

In: Pro- ceedings of the 41st International Conference on Machine Learning

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Pro- ceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Resea...

2024
[20]

In: ICLR (2024),https: //arxiv.org/abs/2309.17102

Fu, T.J., Hu, W., Du, X., Wang, W., Yang, Y., Gan, Z.: Guiding instruction-based image editing via multimodal large language models. In: ICLR (2024),https: //arxiv.org/abs/2309.17102

work page arXiv 2024
[21]

Gomel, E., Wolf, L.: Diffusion-based attention warping for consistent 3d scene editing (2024),https://arxiv.org/abs/2412.07984

work page arXiv 2024
[22]

In: Proceedings of the IEEE/CVF interna- tional conference on computer vision

Haque,A.,Tancik,M.,Efros,A.A.,Holynski,A.,Kanazawa,A.:Instruct-nerf2nerf: Editing 3d scenes with instructions. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 19740–19750 (2023)

2023
[23]

In: International Conference on Learning Representations (2023)

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross-attention control. In: International Conference on Learning Representations (2023)

2023
[24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Hertz, A., Voynov, A., Fruchter, S., Cohen-Or, D.: Style aligned image generation via shared attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4775–4785 (June 2024)

2024
[25]

In: Pro- ceedings of the 34th International Conference on Neural Information Processing Systems

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Pro- ceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20, Curran Associates Inc., Red Hook, NY, USA (2020) GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes 19

2020
[26]

ACM Transactions on Graphics (TOG)44(6), 1–16 (2025)

Jiang, L., Mao, Y., Xu, L., Lu, T., Ren, K., Jin, Y., Xu, X., Yu, M., Pang, J., Zhao, F., et al.: Anysplat: Feed-forward 3d gaussian splatting from unconstrained views. ACM Transactions on Graphics (TOG)44(6), 1–16 (2025)

2025
[27]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., Xu, Q.: Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 150–168. Springer Nature Switzerland, Cham (2025)

2024
[28]

In: The Thirty-ninth Annual Con- ference on Neural Information Processing Systems (2026),https://openreview

Koh, E., Hyun, S., Lee, M., Chung, J., Seo, K., Heo, J.P.: Diffusion feature field for text-based 3d editing with gaussian splatting. In: The Thirty-ninth Annual Con- ference on Neural Information Processing Systems (2026),https://openreview. net/forum?id=Kf9eNbp4wy

2026
[29]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

Lee, D.I., Doh, H., Chi, S., Duan, R., Kim, S., Ramani, K.: Dynamic-editor: Training-free text-driven 4d scene editing with multimodal diffusion transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

2026
[31]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Lee, D.I., Park, H., Seo, J., Park, E., Park, H., Baek, H.D., Shin, S., Kim, S., Kim, S.: Editsplat: Multi-view fusion and attention-guided optimization for view- consistent 3d scene editing with 3d gaussian splatting. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 11135–11145 (2025)

2025
[32]

arXiv preprint arXiv:2508.19247 (2025)

Li, L., Huang, Z., Feng, H., Zhuang, G., Chen, R., Guo, C., Sheng, L.: Voxhammer: Training-free precise and coherent 3d editing in native 3d space. arXiv preprint arXiv:2508.19247 (2025)

work page arXiv 2025
[33]

In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=yirunib8l8

Lin, H., Chen, S., Liew, J.H., Chen, D.Y., Li, Z., Zhao, Y., Peng, S., Guo, H., Zhou, X., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=yirunib8l8

2026
[34]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t

2023
[35]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2021)

Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N., Kanazawa, A.: Infinite nature: Perpetual view generation of natural scenes from a single image. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2021)

2021
[36]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z

Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and trans- fer data with rectified flow. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z

2023
[37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

Liyi, C., Pengfei, W., Guowen, Z., Zhiyuan, M., Lei, Z.: Omni-3dedit: Generalized versatile 3d editing in one-pass. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

2026
[38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Re- paint: Inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11461–11471 (June 2022)

2022
[39]

In: International Conference on Learning Representations (2022) 20 J

Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022) 20 J. Bengtson, Y. Lochman, F. Kahl

2022
[40]

In: ECCV (2024)

Mirzaei, A., Aumentado-Armstrong, T., Brubaker, M.A., Kelly, J., Levinshtein, A., Derpanis, K.G., Gilitschenski, I.: Watch your steps: Local image and scene editing by text instructions. In: ECCV (2024)

2024
[41]

In: CVPR (2023)

Mirzaei, A., Aumentado-Armstrong, T., Derpanis, K.G., Kelly, J., Brubaker, M.A., Gilitschenski, I., Levinshtein, A.: SPIn-NeRF: Multiview segmentation and percep- tual inpainting with neural radiance fields. In: CVPR (2023)

2023
[42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Müller, N., Schwarz, K., Rössle, B., Porzi, L., Bulò, S.R., Nießner, M., Kontschieder, P.: Multidiff: Consistent novel view synthesis from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10258–10268 (June 2024)

2024
[43]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Park, J., Choi, T.E., Jun, Y., Hwang, S.J.: Wave: Warp-based view guidance for consistent novel view synthesis using a single image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11906– 11915 (October 2025)

2025
[44]

In: Meila, M., Zhang, T

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceed- ings of Machine Learning Res...

2021
[45]

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents (2022),https://arxiv.org/abs/ 2204.06125

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 12179–12188 (October 2021)

2021
[47]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(3) (2022)

Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence44(3) (2022)

2022
[48]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Pro- ceedings, Part XI

Rojas, S., Philip, J., Zhang, K., Bi, S., Luan, F., Ghanem, B., Sunkavalli, K.: Datenerf: Depth-aware text-based editing of nerfs. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Pro- ceedings, Part XI. p. 267–284. Springer-Verlag, Berlin, Heidelberg (2024).https: //doi.org/10.1007/978-3-031-73247-8_1...

work page doi:10.1007/978-3-031-73247-8_16 2024
[49]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684– 10695 (June 2022)

2022
[50]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22500–22510 (June 2023)

2023
[51]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., Ho, J., Fleet, D., Norouzi, M.: Photorealistic text-to-image diffusion models with deep language understand- ing. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Informati...

2022
[52]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Seo, J., Fukuda, K., Shibuya, T., Narihira, T., Murata, N., Hu, S., Lai, C.H., Kim, S., Mitsufuji, Y.: Genwarp: Single image to novel views with semantic-preserving generative warping. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Sys- tems. vol. 37, pp. 80220–8024...

work page doi:10.52202/079017-2550 2024
[53]

arXiv preprint arXiv:2312.08563 (2023)

Song, L., Cao, L., Gu, J., Jiang, Y., Yuan, J., Tang, H.: Efficient-nerf2nerf: Stream- lining text-driven 3d editing with multiview correspondence-enhanced diffusion models. arXiv preprint arXiv:2312.08563 (2023)

work page arXiv 2023
[54]

In: ECCV (2024)

Tung, J., Chou, G., Cai, R., Yang, G., Zhang, K., Wetzstein, G., Hariharan, B., Snavely, N.: Megascenes: Scene-level view synthesis at scale. In: ECCV (2024)

2024
[55]

Wang, B., Dutt, N.S., Mitra, N.J.: Proteusnerf: Fast lightweight nerf editing using 3d-aware image context. Proc. ACM Comput. Graph. Interact. Tech.7(1) (may 2024).https://doi.org/10.1145/3651290,https://doi.org/10.1145/3651290

work page doi:10.1145/3651290 2024
[56]

In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

2025
[57]

Edit in 2D, Verify in 3D: Reinforcement Learning for Multi-view Consistent Scene Editing

Wang, J., Lin, C., Sun, L., Cao, Z., Yin, Y., Nie, L., Yuan, Z., Chu, X., Wei, Y., Liao, K., et al.: Geometry-guided reinforcement learning for multi-view consistent 3d scene editing. arXiv preprint arXiv:2603.03143 (2026)

work page internal anchor Pith review arXiv 2026
[58]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20697–20709 (2024)

2024
[59]

In: Computer Vision – ECCV 2024: 18th European Con- ference, Milan, Italy, September 29 – October 4, 2024, Proceedings, Part XXXV

Wang, Y., Yi, X., Wu, Z., Zhao, N., Chen, L., Zhang, H.: View-consistent 3d editing with gaussian splatting. In: Computer Vision – ECCV 2024: 18th European Con- ference, Milan, Italy, September 29 – October 4, 2024, Proceedings, Part XXXV. p. 404–420. Springer-Verlag, Berlin, Heidelberg (2024).https://doi.org/10.1007/ 978-3-031-72761-0_23,https://doi.org/...

work page doi:10.1007/978-3-031-72761-0_23 2024
[60]

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

ECCV (2024)

Wu, J., Bian, J.W., Li, X., Wang, G., Reid, I., Torr, P., Prisacariu, V.: GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing. ECCV (2024)

2024
[62]

Xia, R., Tang, Y., Zhou, P.: Towards scalable and consistent 3d editing (2025), https://arxiv.org/abs/2510.02994

work page arXiv 2025
[63]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13294– 13304 (June 2025)

2025
[64]

Qwen2 Technical Report

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., et. al., F.H.: Qwen2 technical report (2024),https://arxiv.org/abs/2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

In: CVPR (2024)

Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: CVPR (2024)

2024
[66]

Proceedings of the Computer Vision and Pattern Recognition Conference (2020) 22 J

Yao, Y., Luo, Z., Li, S., Zhang, J., Ren, Y., Zhou, L., Fang, T., Quan, L.: Blended- mvs: A large-scale dataset for generalized multi-view stereo networks. Proceedings of the Computer Vision and Pattern Recognition Conference (2020) 22 J. Bengtson, Y. Lochman, F. Kahl

2020
[67]

Ye, J., Xie, S., Zhao, R., Wang, Z., Yan, H., Zu, W., Ma, L., Zhu, J.: Nano3d: A training-free approach for efficient 3d editing without masks (2025),https: //arxiv.org/abs/2510.15019

work page arXiv 2025
[68]

In: International Conference on Learning Representations (2025)

You, M., Zhu, Z., Liu, H., Hou, J.: Nvs-solver: Video diffusion model as zero-shot novel view synthesizer. In: International Conference on Learning Representations (2025)

2025
[69]

Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis . IEEE Transactions on Pattern Analysis & Machine Intelligence (01), 1–18 (Sep 5555).https://doi.org/10.1109/TPAMI.2025.3613256,https: //doi.ieeecomputersociety.org/10.1109/TPAMI....

work page doi:10.1109/tpami.2025.3613256 2025
[70]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually anno- tated dataset for instruction-guided image editing. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems. vol. 36, pp. 31428–31449. Curran Associates, Inc. (2023),https://proceedings.neurips.cc/paper_file...

2023
[71]

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)

2023
[72]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

Zhang, Y., Huang, N., Tang, F., Huang, H., Ma, C., Dong, W., Xu, C.: Inversion- based style transfer with diffusion models. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 10146–10156 (June 2023)

2023
[73]

Zhao, S., Chen, D., Chen, Y.C., Bao, J., Hao, S., Yuan, L., Wong, K.Y.K.: Uni- controlnet:All-in-onecontroltotext-to-imagediffusionmodels.AdvancesinNeural Information Processing Systems (2023)

2023
[74]

IEEE Transactions on Visualization and Com- puter Graphics32(3), 2838–2851 (2026).https://doi.org/10.1109/TVCG.2026

Zhu, Z., Chen, H., Li, P., Wei, M.: Coreeditor: Correspondence-constrained dif- fusion for consistent 3d editing. IEEE Transactions on Visualization and Com- puter Graphics32(3), 2838–2851 (2026).https://doi.org/10.1109/TVCG.2026. 3657658

work page doi:10.1109/tvcg.2026 2026
[75]

Make him carry a bag of groceries

Zhuang, J., Zeng, Y., Liu, W., Yuan, C., Chen, K.: A task is worth one word: Learn- ing with task prompts for high-quality versatile image inpainting. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 195–211. Springer Nature Switzerland, Cham (2025) GeM-NR: Geometry-Aware Multi-View Ed...

2024

[1] [1]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Asim, M., Wewer, C., Wimmer, T., Schiele, B., Lenssen, J.E.: Met3r: Measur- ing multi-view consistency in generated images. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 6034–6044 (2025)

2025

[2] [2]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Bai, Q., Ouyang, H., Xu, Y., Wang, Q., Yang, C., Cheng, K.L., Shen, Y., Chen, Q.: Edicho: Consistent image editing in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 15277–15287 (October 2025)

2025

[3] [3]

Proceedings of the Computer Vision and Pattern Recognition Conference (2022)

Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-nerf 360: Unbounded anti-aliased neural radiance fields. Proceedings of the Computer Vision and Pattern Recognition Conference (2022)

2022

[4] [4]

Bengtson, J., Nilsson, D., Lee, D.I., Lochman, Y., Kahl, F.: 3d-consistent multi- view editing by correspondence guidance (2026),https://arxiv.org/abs/2511. 22228

2026

[5] [5]

In: Wallraven, C., Liu, C.L., Ross, A

Bengtson, J., Nilsson, D., Lin, C.T., Büsching, M., Kahl, F.: Adjustable visual ap- pearance for generalizable novel view synthesis. In: Wallraven, C., Liu, C.L., Ross, A. (eds.) Pattern Recognition and Artificial Intelligence. pp. 157–171. Springer Nature Singapore, Singapore (2025)

2025

[6] [6]

Black Forest Labs: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/ flux-2(2025)

2025

[7] [7]

ai / blog / flux2 - klein - towards - interactive - visual - intelligence (2025)

Black Forest Labs: FLUX.2 [klein]: Towards Interactive Visual Intelligence.https: / / bfl . ai / blog / flux2 - klein - towards - interactive - visual - intelligence (2025)

2025

[8] [8]

In: Proceedings of the Computer Vision and Pattern Recog- nition Conference (2023)

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the Computer Vision and Pattern Recog- nition Conference (2023)

2023

[9] [9]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Caron, M., Touvron, H., Misra, I., J’egou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 9630–9640 (2021), https://api.semanticscholar.org/CorpusID:233444273

2021

[10] [10]

In: Proceedings of the 18 J

Chen, D.Y., Tennent, H., Hsu, C.W.: Artadapter: Text-to-image style transfer using multi-level style encoder and explicit adaptation. In: Proceedings of the 18 J. Bengtson, Y. Lochman, F. Kahl IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8619–8628 (June 2024)

2024

[11] [11]

Chen,L.,Li,R.,Zhang,G.,Wang,P.,Zhang,L.:Fastmulti-viewconsistent3dedit- ing with video priors. Proceedings of the AAAI Conference on Artificial Intelligence 40(4), 2948–2956 (Mar 2026).https://doi.org/10.1609/aaai.v40i4.37286, https://ojs.aaai.org/index.php/AAAI/article/view/37286

work page doi:10.1609/aaai.v40i4.37286 2026

[12] [12]

In: European Conference on Computer Vision

Chen, M., Laina, I., Vedaldi, A.: Dge: Direct gaussian 3d editing by consistent multi-view editing. In: European Conference on Computer Vision. pp. 74–92. Springer (2024)

2024

[13] [13]

Chen, Y., Chen, Z., Zhang, C., Wang, F., Yang, X., Wang, Y., Cai, Z., Yang, L., Liu, H., Lin, G.: Gaussianeditor: Swift and controllable 3d editing with gaussian splatting (2023)

2023

[14] [14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chung, J., Hyun, S., Heo, J.P.: Style injection in diffusion: A training-free approach for adapting large-scale diffusion models for style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8795–8805 (June 2024)

2024

[15] [15]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., et. al., E.R.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities (2025),https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

In: Thirty-seventh Conference on Neural Information Processing Sys- tems (2023)

Dong, J., Wang, Y.X.: Vica-nerf: View-consistency-aware 3d editing of neural radi- ance fields. In: Thirty-seventh Conference on Neural Information Processing Sys- tems (2023)

2023

[17] [17]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

Edstedt, J., Sun, Q., Bökman, G., Wadenbäck, M., Felsberg, M.: Roma: Robust dense feature matching. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 19790–19800 (2024)

2024

[18] [18]

Erkoç, Z., Dai, A., Nießner, M.: Worldagents: Can foundation image models be agents for 3d world models? arXiv preprint arXiv:2603.19708 (2026)

work page arXiv 2026

[19] [19]

In: Pro- ceedings of the 41st International Conference on Machine Learning

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Pro- ceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Resea...

2024

[20] [20]

In: ICLR (2024),https: //arxiv.org/abs/2309.17102

Fu, T.J., Hu, W., Du, X., Wang, W., Yang, Y., Gan, Z.: Guiding instruction-based image editing via multimodal large language models. In: ICLR (2024),https: //arxiv.org/abs/2309.17102

work page arXiv 2024

[21] [21]

Gomel, E., Wolf, L.: Diffusion-based attention warping for consistent 3d scene editing (2024),https://arxiv.org/abs/2412.07984

work page arXiv 2024

[22] [22]

In: Proceedings of the IEEE/CVF interna- tional conference on computer vision

Haque,A.,Tancik,M.,Efros,A.A.,Holynski,A.,Kanazawa,A.:Instruct-nerf2nerf: Editing 3d scenes with instructions. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 19740–19750 (2023)

2023

[23] [23]

In: International Conference on Learning Representations (2023)

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross-attention control. In: International Conference on Learning Representations (2023)

2023

[24] [24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Hertz, A., Voynov, A., Fruchter, S., Cohen-Or, D.: Style aligned image generation via shared attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4775–4785 (June 2024)

2024

[25] [25]

In: Pro- ceedings of the 34th International Conference on Neural Information Processing Systems

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Pro- ceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20, Curran Associates Inc., Red Hook, NY, USA (2020) GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes 19

2020

[26] [26]

ACM Transactions on Graphics (TOG)44(6), 1–16 (2025)

Jiang, L., Mao, Y., Xu, L., Lu, T., Ren, K., Jin, Y., Xu, X., Yu, M., Pang, J., Zhao, F., et al.: Anysplat: Feed-forward 3d gaussian splatting from unconstrained views. ACM Transactions on Graphics (TOG)44(6), 1–16 (2025)

2025

[27] [27]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., Xu, Q.: Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 150–168. Springer Nature Switzerland, Cham (2025)

2024

[28] [28]

In: The Thirty-ninth Annual Con- ference on Neural Information Processing Systems (2026),https://openreview

Koh, E., Hyun, S., Lee, M., Chung, J., Seo, K., Heo, J.P.: Diffusion feature field for text-based 3d editing with gaussian splatting. In: The Thirty-ninth Annual Con- ference on Neural Information Processing Systems (2026),https://openreview. net/forum?id=Kf9eNbp4wy

2026

[29] [29]

Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

Lee, D.I., Doh, H., Chi, S., Duan, R., Kim, S., Ramani, K.: Dynamic-editor: Training-free text-driven 4d scene editing with multimodal diffusion transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

2026

[31] [31]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Lee, D.I., Park, H., Seo, J., Park, E., Park, H., Baek, H.D., Shin, S., Kim, S., Kim, S.: Editsplat: Multi-view fusion and attention-guided optimization for view- consistent 3d scene editing with 3d gaussian splatting. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 11135–11145 (2025)

2025

[32] [32]

arXiv preprint arXiv:2508.19247 (2025)

Li, L., Huang, Z., Feng, H., Zhuang, G., Chen, R., Guo, C., Sheng, L.: Voxhammer: Training-free precise and coherent 3d editing in native 3d space. arXiv preprint arXiv:2508.19247 (2025)

work page arXiv 2025

[33] [33]

In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=yirunib8l8

Lin, H., Chen, S., Liew, J.H., Chen, D.Y., Li, Z., Zhao, Y., Peng, S., Guo, H., Zhou, X., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=yirunib8l8

2026

[34] [34]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t

2023

[35] [35]

In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2021)

Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N., Kanazawa, A.: Infinite nature: Perpetual view generation of natural scenes from a single image. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2021)

2021

[36] [36]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z

Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and trans- fer data with rectified flow. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z

2023

[37] [37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

Liyi, C., Pengfei, W., Guowen, Z., Zhiyuan, M., Lei, Z.: Omni-3dedit: Generalized versatile 3d editing in one-pass. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2026)

2026

[38] [38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Re- paint: Inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11461–11471 (June 2022)

2022

[39] [39]

In: International Conference on Learning Representations (2022) 20 J

Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022) 20 J. Bengtson, Y. Lochman, F. Kahl

2022

[40] [40]

In: ECCV (2024)

Mirzaei, A., Aumentado-Armstrong, T., Brubaker, M.A., Kelly, J., Levinshtein, A., Derpanis, K.G., Gilitschenski, I.: Watch your steps: Local image and scene editing by text instructions. In: ECCV (2024)

2024

[41] [41]

In: CVPR (2023)

Mirzaei, A., Aumentado-Armstrong, T., Derpanis, K.G., Kelly, J., Brubaker, M.A., Gilitschenski, I., Levinshtein, A.: SPIn-NeRF: Multiview segmentation and percep- tual inpainting with neural radiance fields. In: CVPR (2023)

2023

[42] [42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Müller, N., Schwarz, K., Rössle, B., Porzi, L., Bulò, S.R., Nießner, M., Kontschieder, P.: Multidiff: Consistent novel view synthesis from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10258–10268 (June 2024)

2024

[43] [43]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Park, J., Choi, T.E., Jun, Y., Hwang, S.J.: Wave: Warp-based view guidance for consistent novel view synthesis using a single image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11906– 11915 (October 2025)

2025

[44] [44]

In: Meila, M., Zhang, T

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceed- ings of Machine Learning Res...

2021

[45] [45]

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents (2022),https://arxiv.org/abs/ 2204.06125

work page internal anchor Pith review Pith/arXiv arXiv 2022

[46] [46]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 12179–12188 (October 2021)

2021

[47] [47]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(3) (2022)

Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence44(3) (2022)

2022

[48] [48]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Pro- ceedings, Part XI

Rojas, S., Philip, J., Zhang, K., Bi, S., Luan, F., Ghanem, B., Sunkavalli, K.: Datenerf: Depth-aware text-based editing of nerfs. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Pro- ceedings, Part XI. p. 267–284. Springer-Verlag, Berlin, Heidelberg (2024).https: //doi.org/10.1007/978-3-031-73247-8_1...

work page doi:10.1007/978-3-031-73247-8_16 2024

[49] [49]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684– 10695 (June 2022)

2022

[50] [50]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22500–22510 (June 2023)

2023

[51] [51]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., Ho, J., Fleet, D., Norouzi, M.: Photorealistic text-to-image diffusion models with deep language understand- ing. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Informati...

2022

[52] [52]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Seo, J., Fukuda, K., Shibuya, T., Narihira, T., Murata, N., Hu, S., Lai, C.H., Kim, S., Mitsufuji, Y.: Genwarp: Single image to novel views with semantic-preserving generative warping. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Sys- tems. vol. 37, pp. 80220–8024...

work page doi:10.52202/079017-2550 2024

[53] [53]

arXiv preprint arXiv:2312.08563 (2023)

Song, L., Cao, L., Gu, J., Jiang, Y., Yuan, J., Tang, H.: Efficient-nerf2nerf: Stream- lining text-driven 3d editing with multiview correspondence-enhanced diffusion models. arXiv preprint arXiv:2312.08563 (2023)

work page arXiv 2023

[54] [54]

In: ECCV (2024)

Tung, J., Chou, G., Cai, R., Yang, G., Zhang, K., Wetzstein, G., Hariharan, B., Snavely, N.: Megascenes: Scene-level view synthesis at scale. In: ECCV (2024)

2024

[55] [55]

Wang, B., Dutt, N.S., Mitra, N.J.: Proteusnerf: Fast lightweight nerf editing using 3d-aware image context. Proc. ACM Comput. Graph. Interact. Tech.7(1) (may 2024).https://doi.org/10.1145/3651290,https://doi.org/10.1145/3651290

work page doi:10.1145/3651290 2024

[56] [56]

In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

2025

[57] [57]

Edit in 2D, Verify in 3D: Reinforcement Learning for Multi-view Consistent Scene Editing

Wang, J., Lin, C., Sun, L., Cao, Z., Yin, Y., Nie, L., Yuan, Z., Chu, X., Wei, Y., Liao, K., et al.: Geometry-guided reinforcement learning for multi-view consistent 3d scene editing. arXiv preprint arXiv:2603.03143 (2026)

work page internal anchor Pith review arXiv 2026

[58] [58]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20697–20709 (2024)

2024

[59] [59]

In: Computer Vision – ECCV 2024: 18th European Con- ference, Milan, Italy, September 29 – October 4, 2024, Proceedings, Part XXXV

Wang, Y., Yi, X., Wu, Z., Zhao, N., Chen, L., Zhang, H.: View-consistent 3d editing with gaussian splatting. In: Computer Vision – ECCV 2024: 18th European Con- ference, Milan, Italy, September 29 – October 4, 2024, Proceedings, Part XXXV. p. 404–420. Springer-Verlag, Berlin, Heidelberg (2024).https://doi.org/10.1007/ 978-3-031-72761-0_23,https://doi.org/...

work page doi:10.1007/978-3-031-72761-0_23 2024

[60] [60]

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

ECCV (2024)

Wu, J., Bian, J.W., Li, X., Wang, G., Reid, I., Torr, P., Prisacariu, V.: GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing. ECCV (2024)

2024

[62] [62]

Xia, R., Tang, Y., Zhou, P.: Towards scalable and consistent 3d editing (2025), https://arxiv.org/abs/2510.02994

work page arXiv 2025

[63] [63]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13294– 13304 (June 2025)

2025

[64] [64]

Qwen2 Technical Report

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., et. al., F.H.: Qwen2 technical report (2024),https://arxiv.org/abs/2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

In: CVPR (2024)

Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: CVPR (2024)

2024

[66] [66]

Proceedings of the Computer Vision and Pattern Recognition Conference (2020) 22 J

Yao, Y., Luo, Z., Li, S., Zhang, J., Ren, Y., Zhou, L., Fang, T., Quan, L.: Blended- mvs: A large-scale dataset for generalized multi-view stereo networks. Proceedings of the Computer Vision and Pattern Recognition Conference (2020) 22 J. Bengtson, Y. Lochman, F. Kahl

2020

[67] [67]

Ye, J., Xie, S., Zhao, R., Wang, Z., Yan, H., Zu, W., Ma, L., Zhu, J.: Nano3d: A training-free approach for efficient 3d editing without masks (2025),https: //arxiv.org/abs/2510.15019

work page arXiv 2025

[68] [68]

In: International Conference on Learning Representations (2025)

You, M., Zhu, Z., Liu, H., Hou, J.: Nvs-solver: Video diffusion model as zero-shot novel view synthesizer. In: International Conference on Learning Representations (2025)

2025

[69] [69]

Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis . IEEE Transactions on Pattern Analysis & Machine Intelligence (01), 1–18 (Sep 5555).https://doi.org/10.1109/TPAMI.2025.3613256,https: //doi.ieeecomputersociety.org/10.1109/TPAMI....

work page doi:10.1109/tpami.2025.3613256 2025

[70] [70]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually anno- tated dataset for instruction-guided image editing. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems. vol. 36, pp. 31428–31449. Curran Associates, Inc. (2023),https://proceedings.neurips.cc/paper_file...

2023

[71] [71]

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)

2023

[72] [72]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

Zhang, Y., Huang, N., Tang, F., Huang, H., Ma, C., Dong, W., Xu, C.: Inversion- based style transfer with diffusion models. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 10146–10156 (June 2023)

2023

[73] [73]

Zhao, S., Chen, D., Chen, Y.C., Bao, J., Hao, S., Yuan, L., Wong, K.Y.K.: Uni- controlnet:All-in-onecontroltotext-to-imagediffusionmodels.AdvancesinNeural Information Processing Systems (2023)

2023

[74] [74]

IEEE Transactions on Visualization and Com- puter Graphics32(3), 2838–2851 (2026).https://doi.org/10.1109/TVCG.2026

Zhu, Z., Chen, H., Li, P., Wei, M.: Coreeditor: Correspondence-constrained dif- fusion for consistent 3d editing. IEEE Transactions on Visualization and Com- puter Graphics32(3), 2838–2851 (2026).https://doi.org/10.1109/TVCG.2026. 3657658

work page doi:10.1109/tvcg.2026 2026

[75] [75]

Make him carry a bag of groceries

Zhuang, J., Zeng, Y., Liu, W., Yuan, C., Chen, K.: A task is worth one word: Learn- ing with task prompts for high-quality versatile image inpainting. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 195–211. Springer Nature Switzerland, Cham (2025) GeM-NR: Geometry-Aware Multi-View Ed...

2024