Bridging the Manifold Gap: Riemannian Residual Line Search for One-Step Image Editing

Hongzhu Yi; Jungang Xu; Tong Li; Yiyan Fan; Zhongtian Luo

arxiv: 2606.24844 · v1 · pith:HAG7P7TPnew · submitted 2026-06-23 · 💻 cs.CV

Bridging the Manifold Gap: Riemannian Residual Line Search for One-Step Image Editing

Hongzhu Yi , Zhongtian Luo , Tong Li , Yiyan Fan , Jungang Xu This is my paper

Pith reviewed 2026-06-26 00:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords one-step diffusion editingRiemannian residual line searchimage editingCLIP alignmentenergy-field transportmanifold gapprompt-delta field

0 comments

The pith

Riemannian Residual Line Search improves one-step diffusion image editing by curvature-correcting the update and selecting the best residual candidate via CLIP alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that one-step diffusion editors face a trade-off between making the target prompt change and keeping the source image intact, and that no single update strength works for all edit types. Instead of redesigning the editor, it treats the problem as selecting among a few candidate outputs generated from an energy-field transport. The proposed Riemannian Residual Line Search estimates the local curvature of the prompt-delta field to create a stronger edit, projects it back to the original norm, builds a residual path, and then chooses the image that best matches the target prompt according to CLIP. This approach is evaluated on a 700-sample benchmark across ten edit types and outperforms other one-step methods.

Core claim

The method builds a stronger edit by estimating the local time curvature of the prompt-delta field and projecting the corrected direction back onto the update norm of the original first-order energy-field transport estimation. It then forms a small residual path from the source image to this strong edit, keeps the original first-order output as a candidate, and selects the final image by maximizing target-prompt CLIP alignment.

What carries the argument

Riemannian Residual Line Search, which corrects the first-order transport direction using local time curvature of the prompt-delta field and performs post-hoc selection along the residual path using CLIP scores.

Load-bearing premise

That selecting the candidate with highest target-prompt CLIP alignment reliably produces the best visual result without introducing unmeasured artifacts or source-image degradation that CLIP does not penalize.

What would settle it

Human evaluations or side-by-side visual checks on the 700-sample PIE-Bench++ set showing that a lower-CLIP candidate is preferred due to fewer artifacts or better source preservation.

Figures

Figures reproduced from arXiv: 2606.24844 by Hongzhu Yi, Jungang Xu, Tong Li, Yiyan Fan, Zhongtian Luo.

**Figure 1.** Figure 1: Quantitative radar chart comparisons on PIE-Bench++ across varying generation step regimes. We evaluate all methods across seven key dimensions, including text-alignment (CLIP-Whole, CLIP-Edited), structural and perceptual fidelity (PSNR, MSE, SSIM, LPIPS, DINO), and efficiency (Runtime). For clear calibration, our method RRLS (represented by the outermost bold blue profile) serves as the baseline referenc… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison of different one-step image editing paradigms. From top to bottom, the rows display various semantic shifting tasks including category substitution, landscape editing, and attribute modification. Compared to the naive baseline and first-order trajectory correction, our RRLS framework consistently archives superior text-to-image semantic alignment while strictly preserving unedited r… view at source ↗

**Figure 3.** Figure 3: Overview and comparison of different one-step image editing paradigms. (a) Naive One-step Update directly applies the instantaneous velocity field v(xt, t, ctar) from the source path, which heavily relies on a simple linear drift and frequently suffers from either under-editing or severe over-drift. (b) One-step, 1st-order Method approximates the transport velocity via a first-order chord vector u (1)(xt) … view at source ↗

**Figure 4.** Figure 4: Visualization of the learned energy fields and the corresponding editing results by RRLS. For each group, we present the source image (left), the computed latent energy field map (middle), and the final one-step editing outcome (right). The colorbar scales from low (dark purple) to high (light yellow) energy intensity. Crucially, the energy fields precisely localize the semantic regions requiring transform… view at source ↗

**Figure 5.** Figure 5: 2D Example of One-Step Image Editing. One-step update transport can quickly achieve injective alignment of manifolds, whereas the naive one-step update yields a lower degree of manifold alignment. Energy fields using a first-order approximation struggle to adapt when the source and target manifolds differ substantially in shape, while energy fields using a second-order approximation adapt to such cases m… view at source ↗

read the original abstract

One-step diffusion editors are fast because they avoid inversion and iterative optimization, but a single transport update must be aggressive enough to realize the target prompt and conservative enough to preserve the source image--and no fixed update strength satisfies both demands across edit types. We treat this tension as a post-hoc candidate-selection problem on top of energy-field transport rather than as a new editing model. Our proposed method, Riemannian Residual Line Search, first builds a stronger edit by estimating the local time curvature of the prompt-delta field and projecting the corrected direction back onto the update norm of the original first-order energy-field transport estimation. It then forms a small residual path from the source image to this strong edit, retains the original first-order output as one candidate, and picks the final image by maximizing target-prompt CLIP alignment. On a 700-sample PIE-Bench++ evaluation across 10 edit type IDs, our method achieves state-of-the-art (SOTA) performance among current one-step update algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds curvature estimation and Riemannian projection to build a residual candidate path on top of energy-field transport, then picks the max-CLIP output; the SOTA numbers rest on that selection step.

read the letter

The main contribution is a post-hoc procedure that estimates local curvature of the prompt-delta field, projects a corrected direction back onto the original update norm, constructs a short residual path, and selects between the first-order result and the curvature-adjusted one by highest target-prompt CLIP score. This directly addresses the known tension in one-step diffusion editing without changing the underlying transport model.

The approach is straightforward and the 700-sample PIE-Bench++ evaluation across ten edit types is a concrete data point. Treating the problem as candidate selection rather than a new editing architecture keeps the method lightweight and easy to layer on existing one-step pipelines.

The soft spot is that the reported gains appear to come almost entirely from the final CLIP-max choice. The abstract gives no sign that the authors checked whether this selection preserves source fidelity or avoids artifacts that CLIP does not penalize; standard concerns about CLIP as a perceptual proxy apply here. Without ablations on the curvature step itself or comparisons against source-prompt CLIP and LPIPS on the chosen outputs, it is hard to tell how much of the improvement is geometric versus just better prompt alignment.

The work is aimed at people already working on fast, inversion-free diffusion editors. It is narrow but the evaluation is large enough that a referee could usefully check the Riemannian construction and the selection protocol. I would send it for review rather than desk-reject.

Referee Report

1 major / 0 minor

Summary. The paper claims that one-step diffusion editors face an inherent tension between edit aggressiveness and source preservation that no fixed update strength resolves across edit types. It treats this as a post-hoc candidate-selection problem atop energy-field transport: Riemannian Residual Line Search estimates the local time curvature of the prompt-delta field, projects the curvature-corrected direction back onto the norm of the original first-order transport step, forms a short residual path from the source to this stronger edit, retains the first-order output as a candidate, and selects the final image by maximizing target-prompt CLIP alignment. On a 700-sample PIE-Bench++ evaluation spanning 10 edit type IDs, the method is reported to achieve SOTA among current one-step update algorithms.

Significance. If the result holds and the selection step is shown to preserve source fidelity, the approach supplies a lightweight, training-free improvement to existing one-step editors by exploiting the geometry of the prompt-delta field. The 700-sample, 10-type evaluation scale is substantial for the sub-area and would make the method immediately usable if the reported gains survive additional source-preservation checks.

major comments (1)

[Abstract] Abstract: the SOTA claim rests entirely on the final CLIP-max selection step after the Riemannian residual path is constructed. No results are supplied for source-prompt CLIP similarity, LPIPS, or any other source-fidelity metric to confirm that the selected image does not degrade the source relative to the first-order baseline; this assumption is load-bearing for the central performance claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and will incorporate the requested source-fidelity analysis in the revision.

read point-by-point responses

Referee: [Abstract] Abstract: the SOTA claim rests entirely on the final CLIP-max selection step after the Riemannian residual path is constructed. No results are supplied for source-prompt CLIP similarity, LPIPS, or any other source-fidelity metric to confirm that the selected image does not degrade the source relative to the first-order baseline; this assumption is load-bearing for the central performance claim.

Authors: We agree that source-fidelity metrics are necessary to validate the central claim. The manuscript reports target-prompt CLIP scores and qualitative preservation but does not supply quantitative source-prompt CLIP similarity, LPIPS, or equivalent comparisons between the CLIP-selected output and the first-order baseline. In the revision we will add these metrics (both in the abstract and in the main results) to demonstrate that the selection step does not degrade source fidelity relative to the baseline. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation; empirical method with explicit selection rule

full rationale

The provided abstract and description contain no equations, first-principles derivations, or predictions that reduce to inputs by construction. The method is presented as an algorithmic procedure (curvature estimation on the prompt-delta field, residual path construction, and final selection by target-prompt CLIP alignment) whose output is then evaluated on an external benchmark (PIE-Bench++). No self-citation load-bearing steps, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are visible. The CLIP selection is an explicit, stated component of the algorithm rather than a hidden redefinition of the result; any concern about its correlation with perceptual quality is a correctness or metric-validity issue, not a circularity reduction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5711 in / 1047 out tokens · 16469 ms · 2026-06-26T00:23:52.416208+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 7 canonical work pages · 6 internal anchors

[1]

Stable flow: Vital layers for training-free image editing

Omri Avrahami, Or Patashnik, Ohad Fried, Egor Nemchi- nov, Kfir Aberman, Dani Lischinski, and Daniel Cohen-Or. Stable flow: Vital layers for training-free image editing. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7877–7888, 2025. 3

2025
[2]

A computational fluid mechanics solution to the monge-kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393,

Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393,
[3]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 2, 3

2023
[4]

Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 22560–22570, 2023. 3, 7

2023
[5]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 6

2021
[6]

Bifm: Bidirectional flow matching for few-step image editing and generation

Yasong Dai, Zeeshan Hayder, David Ahmedt-Aristizabal, and Hongdong Li. Bifm: Bidirectional flow matching for few-step image editing and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23325–23334, 2026. 3

2026
[7]

Turboedit: Text-based image editing using few-step diffusion models

Gilad Deutch, Rinon Gal, Daniel Garibi, Or Patashnik, and Daniel Cohen-Or. Turboedit: Text-based image editing using few-step diffusion models. InSIGGRAPH Asia 2024 Con- ference Papers, pages 1–12, 2024. 2, 7

2024
[8]

Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 2

2021
[9]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,
[10]

Dit4edit: Dif- fusion transformer for image editing

Kunyu Feng, Yue Ma, Bingyuan Wang, Chenyang Qi, Haozhe Chen, Qifeng Chen, and Zeyu Wang. Dit4edit: Dif- fusion transformer for image editing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2969– 2977, 2025. 2

2025
[11]

Instantedit: Text-guided few-step image editing with piecewise rectified flow

Yiming Gong, Zhen Zhu, and Minjia Zhang. Instantedit: Text-guided few-step image editing with piecewise rectified flow. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 16808–16817, 2025. 7

2025
[12]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

2020
[14]

Paralleledits: Efficient multi-aspect text- driven image editing with attention grouping.Advances in Neural Information Processing Systems, 37:22569–22595,

Mingzhen Huang, Jialing Cai, Shan Jia, Vishnu S Lokhande, and Siwei Lyu. Paralleledits: Efficient multi-aspect text- driven image editing with attention grouping.Advances in Neural Information Processing Systems, 37:22569–22595,
[15]

An edit friendly ddpm noise space: Inversion and manipulations

Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer Michaeli. An edit friendly ddpm noise space: Inversion and manipulations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12469– 12478, 2024. 2

2024
[16]

Pnp inversion: Boosting diffusion-based editing with 3 lines of code

Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Pnp inversion: Boosting diffusion-based editing with 3 lines of code. InInternational Conference on Learn- ing Representations, pages 23395–23422, 2024. 2, 6, 7

2024
[17]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6007–6017, 2023. 3

2023
[18]

Flowalign: Trajectory-regularized, inversion-free flow- based image editing.arXiv preprint arXiv:2505.23145,

Jeongsol Kim, Yeobin Hong, Jonghyun Park, and Jong Chul Ye. Flowalign: Trajectory-regularized, inversion-free flow- based image editing.arXiv preprint arXiv:2505.23145,

work page arXiv
[19]

Flowedit: Inversion- free text-based editing using pre-trained flow models

Vladimir Kulikov, Matan Kleiner, Inbar Huberman- Spiegelglas, and Tomer Michaeli. Flowedit: Inversion- free text-based editing using pre-trained flow models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19721–19730, 2025. 7

2025
[20]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 2, 3 8

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Chordedit: One-step low-energy trans- port for image editing

Liangsi Lu, Xuhang Chen, Minzhe Guo, Shichu Li, Jingchao Wang, and Yang Shi. Chordedit: One-step low-energy trans- port for image editing. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14398–14407, 2026. 2, 3

2026
[23]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions.arXiv preprint arXiv:2108.01073, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Null-text inversion for editing real im- ages using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real im- ages using guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6038–6047, 2023. 2, 3

2023
[25]

Swiftedit: Lightning fast text- guided image editing via one-step diffusion

Trong-Tung Nguyen, Quang Nguyen, Khoi Nguyen, Anh Tran, and Cuong Pham. Swiftedit: Lightning fast text- guided image editing via one-step diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 21492–21501, 2025. 3, 7

2025
[26]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InInternational Confer- ence on Learning Representations, pages 1862–1874, 2024. 2

2024
[27]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

2021
[28]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

2022
[29]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2

2022
[30]

Lightning-fast image inversion and editing for text-to-image diffusion models

Dvir Samuel, Barak Meiri, Haggai Maron, Yoad Tewel, Nir Darshan, Shai Avidan, Gal Chechik, and Rami Ben-Ari. Lightning-fast image inversion and editing for text-to-image diffusion models. InInternational Conference on Learning Representations, pages 38384–38404, 2025. 2, 3

2025
[31]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer,
[32]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2010
[33]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2011
[34]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023. 2

2023
[35]

Plug-and-play diffusion features for text-driven image-to-image translation

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1921–1930, 2023. 3, 7

1921
[36]

Springer, 2009

C ´edric Villani et al.Optimal transport: old and new. Springer, 2009. 2

2009
[37]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

2004
[38]

Inversion-free image editing with language-guided dif- fusion models

Sihan Xu, Yidong Huang, Jiayi Pan, Ziqiao Ma, and Joyce Chai. Inversion-free image editing with language-guided dif- fusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9452– 9461, 2024. 2, 7

2024
[39]

Eedit: Rethinking the spatial and temporal redundancy for efficient image editing

Zexuan Yan, Yue Ma, Chang Zou, Wenteng Chen, Qifeng Chen, and Linfeng Zhang. Eedit: Rethinking the spatial and temporal redundancy for efficient image editing. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 17474–17484, 2025. 3

2025
[40]

which way to push

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6 9 Appendix A Appendix . Implementation Details . . . . . . . . . 10 B Appendix . Proofs and Theoret...

2018
[41]

curvature-normalized

Dividing byHW Cgives the result. Proposition B.4 explains why fixed-αbaselines trace a smooth preservation frontier and why even modest de- creases inαproduce large preservation gains: the penalty grows asα 2 while target-alignment gains inΦare typically sublinear inα. B.5. Optimality of RRLS over ChordEdit on its own utility Proposition B.5(Utility domin...

[1] [1]

Stable flow: Vital layers for training-free image editing

Omri Avrahami, Or Patashnik, Ohad Fried, Egor Nemchi- nov, Kfir Aberman, Dani Lischinski, and Daniel Cohen-Or. Stable flow: Vital layers for training-free image editing. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7877–7888, 2025. 3

2025

[2] [2]

A computational fluid mechanics solution to the monge-kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393,

Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393,

[3] [3]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 2, 3

2023

[4] [4]

Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 22560–22570, 2023. 3, 7

2023

[5] [5]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 6

2021

[6] [6]

Bifm: Bidirectional flow matching for few-step image editing and generation

Yasong Dai, Zeeshan Hayder, David Ahmedt-Aristizabal, and Hongdong Li. Bifm: Bidirectional flow matching for few-step image editing and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23325–23334, 2026. 3

2026

[7] [7]

Turboedit: Text-based image editing using few-step diffusion models

Gilad Deutch, Rinon Gal, Daniel Garibi, Or Patashnik, and Daniel Cohen-Or. Turboedit: Text-based image editing using few-step diffusion models. InSIGGRAPH Asia 2024 Con- ference Papers, pages 1–12, 2024. 2, 7

2024

[8] [8]

Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 2

2021

[9] [9]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

[10] [10]

Dit4edit: Dif- fusion transformer for image editing

Kunyu Feng, Yue Ma, Bingyuan Wang, Chenyang Qi, Haozhe Chen, Qifeng Chen, and Zeyu Wang. Dit4edit: Dif- fusion transformer for image editing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2969– 2977, 2025. 2

2025

[11] [11]

Instantedit: Text-guided few-step image editing with piecewise rectified flow

Yiming Gong, Zhen Zhu, and Minjia Zhang. Instantedit: Text-guided few-step image editing with piecewise rectified flow. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 16808–16817, 2025. 7

2025

[12] [12]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

2020

[14] [14]

Paralleledits: Efficient multi-aspect text- driven image editing with attention grouping.Advances in Neural Information Processing Systems, 37:22569–22595,

Mingzhen Huang, Jialing Cai, Shan Jia, Vishnu S Lokhande, and Siwei Lyu. Paralleledits: Efficient multi-aspect text- driven image editing with attention grouping.Advances in Neural Information Processing Systems, 37:22569–22595,

[15] [15]

An edit friendly ddpm noise space: Inversion and manipulations

Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer Michaeli. An edit friendly ddpm noise space: Inversion and manipulations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12469– 12478, 2024. 2

2024

[16] [16]

Pnp inversion: Boosting diffusion-based editing with 3 lines of code

Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Pnp inversion: Boosting diffusion-based editing with 3 lines of code. InInternational Conference on Learn- ing Representations, pages 23395–23422, 2024. 2, 6, 7

2024

[17] [17]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6007–6017, 2023. 3

2023

[18] [18]

Flowalign: Trajectory-regularized, inversion-free flow- based image editing.arXiv preprint arXiv:2505.23145,

Jeongsol Kim, Yeobin Hong, Jonghyun Park, and Jong Chul Ye. Flowalign: Trajectory-regularized, inversion-free flow- based image editing.arXiv preprint arXiv:2505.23145,

work page arXiv

[19] [19]

Flowedit: Inversion- free text-based editing using pre-trained flow models

Vladimir Kulikov, Matan Kleiner, Inbar Huberman- Spiegelglas, and Tomer Michaeli. Flowedit: Inversion- free text-based editing using pre-trained flow models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19721–19730, 2025. 7

2025

[20] [20]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 2, 3 8

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Chordedit: One-step low-energy trans- port for image editing

Liangsi Lu, Xuhang Chen, Minzhe Guo, Shichu Li, Jingchao Wang, and Yang Shi. Chordedit: One-step low-energy trans- port for image editing. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14398–14407, 2026. 2, 3

2026

[23] [23]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions.arXiv preprint arXiv:2108.01073, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021

[24] [24]

Null-text inversion for editing real im- ages using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real im- ages using guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6038–6047, 2023. 2, 3

2023

[25] [25]

Swiftedit: Lightning fast text- guided image editing via one-step diffusion

Trong-Tung Nguyen, Quang Nguyen, Khoi Nguyen, Anh Tran, and Cuong Pham. Swiftedit: Lightning fast text- guided image editing via one-step diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 21492–21501, 2025. 3, 7

2025

[26] [26]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InInternational Confer- ence on Learning Representations, pages 1862–1874, 2024. 2

2024

[27] [27]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

2021

[28] [28]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

2022

[29] [29]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2

2022

[30] [30]

Lightning-fast image inversion and editing for text-to-image diffusion models

Dvir Samuel, Barak Meiri, Haggai Maron, Yoad Tewel, Nir Darshan, Shai Avidan, Gal Chechik, and Rami Ben-Ari. Lightning-fast image inversion and editing for text-to-image diffusion models. InInternational Conference on Learning Representations, pages 38384–38404, 2025. 2, 3

2025

[31] [31]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer,

[32] [32]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2010

[33] [33]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2011

[34] [34]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023. 2

2023

[35] [35]

Plug-and-play diffusion features for text-driven image-to-image translation

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1921–1930, 2023. 3, 7

1921

[36] [36]

Springer, 2009

C ´edric Villani et al.Optimal transport: old and new. Springer, 2009. 2

2009

[37] [37]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

2004

[38] [38]

Inversion-free image editing with language-guided dif- fusion models

Sihan Xu, Yidong Huang, Jiayi Pan, Ziqiao Ma, and Joyce Chai. Inversion-free image editing with language-guided dif- fusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9452– 9461, 2024. 2, 7

2024

[39] [39]

Eedit: Rethinking the spatial and temporal redundancy for efficient image editing

Zexuan Yan, Yue Ma, Chang Zou, Wenteng Chen, Qifeng Chen, and Linfeng Zhang. Eedit: Rethinking the spatial and temporal redundancy for efficient image editing. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 17474–17484, 2025. 3

2025

[40] [40]

which way to push

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6 9 Appendix A Appendix . Implementation Details . . . . . . . . . 10 B Appendix . Proofs and Theoret...

2018

[41] [41]

curvature-normalized

Dividing byHW Cgives the result. Proposition B.4 explains why fixed-αbaselines trace a smooth preservation frontier and why even modest de- creases inαproduce large preservation gains: the penalty grows asα 2 while target-alignment gains inΦare typically sublinear inα. B.5. Optimality of RRLS over ChordEdit on its own utility Proposition B.5(Utility domin...