arxiv: 2605.12724 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Inline Critic Steers Image Editing

Jason Kuen, Kangning Liu, Mang Tik Chiu, Weitai Kang, Xiaohang Zhan, Yan Yan, Yizhou Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords image editinginline criticfrozen modelintermediate layerserror patternsteering generationdiffusion modelrefinement token

0 comments

The pith

A learnable critic token inserted at intermediate layers steers a frozen image-editing model to refine its predictions during the forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that error patterns in image editing are already fixed in early layers of the model, with a rank correlation of 0.83 to the final error map. It introduces Inline Critic, a learnable token that critiques predictions at those layers and adjusts hidden states to steer generation without unfreezing the backbone. This enables correction while computation is still ongoing, rather than after a full image is produced. A sympathetic reader would care because it points to a more efficient way to handle the uneven difficulty of editing different image regions.

Core claim

Although generation capability appears only in the last few layers of a frozen image-editing model, the error pattern is already determined early, with rank correlation 0.83 to the final-layer error map. Inline Critic is a learnable token that critiques the model's intermediate predictions and steers its hidden states to refine the output during the single forward pass. A three-stage training recipe stabilizes learning from critique to actual steering. This produces state-of-the-art scores of 7.89 on GEdit-Bench, a 9.4-point gain on RISEBench over the same backbone, and 81.92 on KRIS-Bench, exceeding GPT-4o. Analyses confirm the token genuinely alters attention and prediction updates in the

What carries the argument

Inline Critic, a learnable token that critiques predictions at intermediate layers and steers hidden states

If this is right

The critic token alters attention patterns in layers after its insertion.
Performance gains appear on region-aware editing benchmarks without retraining the full model.
A three-stage training process allows stable transition from learning critiques to applying steering.
Error correction happens inside one forward pass rather than after image completion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-error signal could be tested in text or video generation models that also show late emergence of capability.
If the correlation between early and final errors holds across more architectures, it would support lightweight adaptation layers instead of full fine-tuning.
This approach might reduce the need for multi-step refinement loops by catching mistakes before they fully form.

Load-bearing premise

The early-layer error pattern can be translated by the critic token into effective steering of later hidden states without destabilizing the frozen backbone.

What would settle it

Measure whether inserting the critic token produces no measurable change in final edit quality on GEdit-Bench or leaves the attention maps in later layers statistically unchanged from the baseline.

Figures

Figures reproduced from arXiv: 2605.12724 by Jason Kuen, Kangning Liu, Mang Tik Chiu, Weitai Kang, Xiaohang Zhan, Yan Yan, Yizhou Wang.

**Figure 1.** Figure 1: Motivation for inline critic. Top: On a frozen model, we add lightweight heads at intermediate blocks (L6–L54) to probe the generation outputs, with the original final output at L60. Middle: Each head runs its own multi-step denoising to generate an image. Bottom: Based on the final output, we compute a MSE map against the ground truth in each block. After 8x pooling and rank-normalization, we measure the … view at source ↗

**Figure 2.** Figure 2: Three-stage training of Inline Critic. Left: Stage 1 trains probe heads at several intermediate layers to predict the generation target from frozen hidden states. Middle: Stage 2 adds a critic token that is trained, at every probed block, to predict the error of Stage-1’s probes; an attention mask hides it from other tokens so the model and the probes are unaffected. Right: Stage 3 removes the mask and tra… view at source ↗

**Figure 3.** Figure 3: Heterogeneity results on ImgEdit. Bins are formed by baseline performance (left) and edit type (right). Our method’s improvement concentrates on the hard cases [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Top: per-sample histogram of ρ between the different model-side spatial map and the critic’s prediction. Bottom: one sample showing the spatial map (left) and critic’s prediction (right). Experiment C: generation shift by critic’s causal intervention. We finally test whether the alignment observed in Experiment B is genuinely causal. To this end, we run inference twice on the same input: once without maski… view at source ↗

**Figure 5.** Figure 5: Qualitative results. We visualize the probe-layer error maps across backbone layers (L6→L54), together with the input image (left) and the final edited result (right). C Limitations The critic is trained to predict the per-position MSE of the probe head, but the training data can be highly noisy. Therefore it is not always a faithful measure of generation quality, which leaves the critic’s learning signal … view at source ↗

read the original abstract

Instruction-based image editing exhibits heterogeneous difficulty not only across cases but also across regions of an image, motivating refinement approaches that allocate correction to where the model struggles. Existing refinement signals arrive late, after a fully generated image or a completed denoising step. We ask whether such a signal can act within an ongoing forward pass. To investigate this, we probe a frozen image-editing model and find that although generation capability emerges only in the last few layers, the error pattern is already set in early layers (rank correlation \r{ho} = 0.83 with the final-layer error map). Based on this, we introduce Inline Critic, a learnable token that critiques a frozen model's predictions at its intermediate layers and steers its hidden states to refine generation during the forward pass. A three-stage recipe is proposed to stabilize the training from learning how to critique to steering generation. As a result, we achieve state of the art on GEdit-Bench (7.89), a +9.4 gain on RISEBench over the same backbone, and the strongest open-source result on KRIS-Bench (81.92, surpassing GPT-4o). We further provide analyses showing that the critic genuinely shapes the model's attention and prediction updates at subsequent layers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The inline critic token uses early error signals to steer editing mid-pass and delivers measurable benchmark gains, but the three-stage training recipe points to stability trade-offs that need more scrutiny.

read the letter

The main takeaway is that error patterns in these instruction-based editing models show up early enough (rank correlation 0.83 with final maps) that a learnable token inserted at intermediate layers can critique and adjust hidden states on the fly instead of waiting for a finished output. That timing difference is the actual novelty relative to post-generation refinement work. The paper shows the critic changes attention and predictions in later layers, and the numbers look decent: 7.89 on GEdit-Bench, a 9.4-point lift on RISEBench over the same backbone, and the best open-source score on KRIS-Bench at 81.92. Those gains are the concrete evidence that the mechanism can matter in practice. The probing step itself is straightforward and supports the early-intervention premise without circularity. The three-stage training recipe is presented as necessary to move from learning critiques to actual steering without breaking the frozen backbone, which is honest about the practical difficulty. That said, the recipe itself is a soft spot because it implies the correlation does not translate directly into reliable steering; some per-model retuning or stabilization steps are required. The abstract and summary give limited detail on ablations, controls for data leakage, or tests across multiple backbones, so it is not yet clear how general the gains are or whether the attention changes actually reduce the targeted error regions. The stress-test concern about missing before-and-after error-map correlations is fair and not fully addressed in the provided text. This paper is for people already working on efficient refinement inside diffusion or transformer editing pipelines who need something lighter than full post-hoc correction. It is coherent on its own terms and shows clear engagement with the timing issue in generation, so it deserves a serious referee to check the implementation details and generalization claims.

Referee Report

3 major / 2 minor

Summary. The paper probes a frozen image-editing model and reports that error patterns are already established in early layers (rank correlation ρ = 0.83 with the final error map) even though generation capability emerges only in the last layers. It introduces Inline Critic, a learnable token that critiques predictions at intermediate layers and steers hidden states during the forward pass. A three-stage training recipe stabilizes the process from critique learning to steering. The method yields SOTA results on GEdit-Bench (7.89), a +9.4 gain on RISEBench over the same backbone, and the strongest open-source score on KRIS-Bench (81.92, exceeding GPT-4o), with analyses showing effects on attention and subsequent predictions.

Significance. If the early-error correlation can be reliably translated into targeted steering without destabilizing the frozen backbone, the approach offers a novel inline refinement mechanism that could improve efficiency in transformer-based image editing. The quantified benchmark gains and attention analyses provide concrete support for the premise, though the absence of internal parameter-free derivations or multi-backbone tests limits broader generalizability.

major comments (3)

[Abstract and §3] Abstract and §3 (probing and critic design): the claim that the observed ρ = 0.83 rank correlation between early and final error maps can be directly leveraged by the learnable critic token to produce effective steering lacks supporting quantitative evidence such as pre-/post-intervention error-map correlations or targeted ablations; this assumption is load-bearing for the central performance claims.
[§4] §4 (three-stage recipe): no stability metrics (e.g., hidden-state norm changes or gradient magnitudes on the frozen backbone) or direct comparisons to single-stage training are reported, leaving open the possibility that the recipe is required precisely because direct use of the correlation destabilizes the model.
[§5] §5 (analyses): while attention shaping and prediction updates are demonstrated, the section does not quantify whether these changes reduce error in the regions identified by the early error maps (e.g., via spatial correlation between attention deltas and error reduction), weakening the link between the critic mechanism and the reported gains.

minor comments (2)

[Abstract] Abstract: the notation “rank correlation ρ = 0.83” is written as “rank correlation r{ho} = 0.83”; standardize to ρ throughout.
[Experiments] Experiments: benchmark scores should include standard deviations across runs or statistical significance tests to substantiate the +9.4 gain and SOTA claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and will revise the manuscript to incorporate additional quantitative evidence and analyses as requested.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (probing and critic design): the claim that the observed ρ = 0.83 rank correlation between early and final error maps can be directly leveraged by the learnable critic token to produce effective steering lacks supporting quantitative evidence such as pre-/post-intervention error-map correlations or targeted ablations; this assumption is load-bearing for the central performance claims.

Authors: The observed rank correlation of ρ = 0.83 between early- and final-layer error maps provides the motivation for placing the critic at intermediate layers. The central performance claims are supported by the consistent gains across GEdit-Bench, RISEBench, and KRIS-Bench together with the attention and prediction-update analyses. We nevertheless agree that direct quantitative linkage is desirable and will add pre-/post-intervention error-map correlations as well as targeted ablations in the revised version. revision: yes
Referee: [§4] §4 (three-stage recipe): no stability metrics (e.g., hidden-state norm changes or gradient magnitudes on the frozen backbone) or direct comparisons to single-stage training are reported, leaving open the possibility that the recipe is required precisely because direct use of the correlation destabilizes the model.

Authors: The three-stage recipe was introduced precisely to ensure stable training when the critic begins to steer hidden states. We accept that stability metrics (hidden-state norm changes, gradient magnitudes on the frozen backbone) and explicit single-stage comparisons would strengthen the presentation and will include them in the revision. revision: yes
Referee: [§5] §5 (analyses): while attention shaping and prediction updates are demonstrated, the section does not quantify whether these changes reduce error in the regions identified by the early error maps (e.g., via spatial correlation between attention deltas and error reduction), weakening the link between the critic mechanism and the reported gains.

Authors: The analyses already demonstrate that the critic alters attention patterns and subsequent predictions. To make the connection to regional error reduction explicit, we will add spatial-correlation measurements between attention deltas and error-map reductions in the regions flagged by the early-layer error maps. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical probing and staged training yield benchmark gains without self-referential reduction

full rationale

The paper's chain consists of an empirical observation (early-layer error pattern with ρ=0.83 correlation to final error map, obtained by probing a frozen backbone) followed by introduction of a learnable critic token and a three-stage training procedure to enable steering. These steps do not reduce to their inputs by construction: the correlation is measured externally, the critic parameters are optimized against benchmark objectives, and performance claims rest on independent evaluations (GEdit-Bench, RISEBench, KRIS-Bench) rather than any fitted quantity being renamed as a prediction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no derivation equates the output to the input definition. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method rests on the empirical observation that error patterns stabilize early and on the assumption that a critic token can be trained to act on them; no free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption Error pattern observed in early layers correlates strongly (ρ = 0.83) with final-layer error map
Derived from probing experiments on the frozen model

invented entities (1)

Inline Critic learnable token no independent evidence
purpose: Critique intermediate predictions and steer hidden states during forward pass
New component introduced to enable mid-generation refinement

pith-pipeline@v0.9.0 · 5531 in / 1355 out tokens · 50102 ms · 2026-05-14T20:48:07.135910+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
although generation capability emerges only in the last few layers, the error pattern is already set in early layers (rank correlation ρ=0.83 with the final-layer error map)

Reference graph

Works this paper leans on

54 extracted references · 25 canonical work pages · 7 internal anchors

[1]

Self-rectifying diffusion sampling with perturbed-attention guidance

Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, and Seungryong Kim. Self-rectifying diffusion sampling with perturbed-attention guidance. InComputer Vision – ECCV 2024, pages 1–17, 2024. doi: 10.1007/978-3-031-73464-9_1. URL https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/09184.pdf. 9

work page doi:10.1007/978-3-031-73464-9_1 2024
[2]

A neural space-time representation for text-to-image personalization.ACM Transactions on Graphics (SIGGRAPH Asia), 42(6):243:1–243:10,

Yuval Alaluf, Elad Richardson, Gal Metzer, and Daniel Cohen-Or. A neural space-time representation for text-to-image personalization.ACM Transactions on Graphics (SIGGRAPH Asia), 42(6):243:1–243:10,
[3]

URLhttps://doi.org/10.1145/3618322

doi: 10.1145/3618322. URLhttps://doi.org/10.1145/3618322

work page doi:10.1145/3618322
[5]

URLhttps://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv
[6]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 Kontext: Flow matching for in-context image ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. MaskGIT: Masked generative image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11315–11325, June 2022. URL https://openaccess.thecvf.com/content/CVPR2022/html/Chang_MaskGIT_Masked_ Generative_Image_Transformer_CVPR_2022_paper.html

2022
[8]

Attend-and-excite: Attention- based semantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (SIGGRAPH), 42(4):1–10, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention- based semantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (SIGGRAPH), 42(4):1–10, 2023. doi: 10.1145/3592116. URLhttps://doi.org/10.1145/3592116

work page doi:10.1145/3592116 2023
[9]

ShareGPT-4o-Image: Aligning multimodal models with GPT-4o-level image generation

Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. ShareGPT-4o-Image: Aligning multimodal models with GPT-4o-level image generation. arXiv preprint arXiv:2506.18095, 2025. URLhttps://arxiv.org/abs/2506.18095

work page arXiv 2025
[10]

OpenGPT-4o-Image: A compre- hensive dataset for advanced image generation and editing.arXiv preprint arXiv:2509.24900, 2025

Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, Pengfei Wan, and Yi-Fan Zhang. OpenGPT-4o-Image: A compre- hensive dataset for advanced image generation and editing.arXiv preprint arXiv:2509.24900, 2025. URL https://arxiv.org/abs/2509.24900

work page arXiv 2025
[11]

Diffusion model guided sampling with pixel- wise aleatoric uncertainty estimation

Michele De Vita and Vasileios Belagiannis. Diffusion model guided sampling with pixel- wise aleatoric uncertainty estimation. InProceedings of the Winter Conference on Ap- plications of Computer Vision (WACV), pages 3844–3854, February 2025. URL https: //openaccess.thecvf.com/content/WACV2025/html/De_Vita_Diffusion_Model_Guided_ Sampling_with_Pixel-Wise_A...

2025
[12]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. URLhttps://arxiv.org/abs/2505.14683

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

DreamArtist++: Controllable one-shot text-to-image generation via positive-negative adapter.arXiv preprint arXiv:2211.11337, 2022

Ziyi Dong, Pengxu Wei, and Liang Lin. DreamArtist++: Controllable one-shot text-to-image generation via positive-negative adapter.arXiv preprint arXiv:2211.11337, 2022. URL https://arxiv.org/abs/ 2211.11337

work page arXiv 2022
[14]

Diffusion self-guidance for controllable image generation

Dave Epstein, Allan Jabri, Ben Poole, Alexei Efros, and Aleksander Holynski. Diffusion self-guidance for controllable image generation. InAdvances in Neural Information Processing Systems (NeurIPS), vol- ume 36, pages 16222–16239. Curran Associates, Inc., 2023. URL https://papers.nips.cc/paper_ files/paper/2023/hash/3469b211b829b39d2b0cfd3b880a869c-Abstra...

2023
[15]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning, volu...

2024
[16]

An image is worth one word: Personalizing text-to-image generation using textual inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/ forum?id=NAQvF08TcyG. 10

2023
[17]

UniREditBench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025

Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, and Jiaqi Wang. UniREditBench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025. URLhttps://arxiv.org/abs/2511.01295

work page arXiv 2025
[18]

Ziqi Huang, Tianxing Wu, Yuming Jiang, Kelvin C. K. Chan, and Ziwei Liu. ReVersion: Diffusion-based relation inversion from images. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. doi: 10.1145/3680528.3687658. URLhttps://ziqihuangg.github.io/projects/reversion

work page doi:10.1145/3680528.3687658 2024
[19]

HQ-Edit: A high-quality dataset for instruction-based image editing

Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Cihang Xie, and Yuyin Zhou. HQ-Edit: A high-quality dataset for instruction-based image editing. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=mZptYYttFj

2025
[20]

Experiment with Gemini 2.0 Flash native image gener- ation

Kat Kampf and Nicole Brichtova. Experiment with Gemini 2.0 Flash native image gener- ation. Google Developers Blog, 2025. URL https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation

2025
[21]

Guiding a diffusion model with a bad version of itself

Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), volume 37, pages 52996–53021. Curran Associates, Inc., 2024. doi: 10. 52202/079017-1679. URL https://proceedings.neurips.cc/paper_files/paper...

2024
[22]

BayesDiff: Estimating pixel-wise uncertainty in diffusion via bayesian inference

Siqi Kou, Lei Gan, Dequan Wang, Chongxuan Li, and Zhijie Deng. BayesDiff: Estimating pixel-wise uncertainty in diffusion via bayesian inference. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://openreview.net/forum?id=YcM6ofShwY

2024
[23]

Feedback guidance of diffusion models

Felix Koulischer, Florian Handke, Johannes Deleu, Thomas Demeester, and Luca Ambrogioni. Feedback guidance of diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URLhttps://openreview.net/forum?id=8ySOcf7UpM

2025
[24]

Improved masked image generation with token- critic

José Lezama, Huiwen Chang, Lu Jiang, and Irfan Essa. Improved masked image generation with token- critic. InComputer Vision – ECCV 2022, pages 70–86, 2022. doi: 10.1007/978-3-031-20050-2_5. URL https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/2901_ECCV_2022_paper.php

work page doi:10.1007/978-3-031-20050-2_5 2022
[25]

ThinkRL-Edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026

Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, and Deng Cai. ThinkRL-Edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026. URLhttps://arxiv.org/abs/2601.03467

work page arXiv 2026
[26]

Reflect-DiT: Inference-time scaling for text-to-image diffusion transformers via in-context reflection.arXiv preprint arXiv:2503.12271, 2025

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Reflect-DiT: Inference-time scaling for text-to-image diffusion transformers via in-context reflection.arXiv preprint arXiv:2503.12271, 2025. URLhttps://arxiv.org/abs/2503.12271

work page arXiv 2025
[27]

Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, and Li Yuan. Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025. URLhttps://arxiv.org/abs/2510.16888

work page arXiv 2025
[28]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t

2023
[30]

URLhttps://arxiv.org/abs/2504.17761

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=XVjTT1nw5z

2023
[32]

EditScore: Unlocking online RL for image editing via high-fidelity reward modeling

Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, and Zheng Liu. EditScore: Unlocking online RL for image editing via high-fidelity reward modeling. In International Conference on Learning Representations (ICLR), 2026. URL https://openreview.net/ forum?id=E7YpL4L4Xh. 11

2026
[33]

Rusu, Joel Veness, Marc G

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement ...

work page doi:10.1038/nature14236 2015
[34]

Introducing 4o image generation

OpenAI. Introducing 4o image generation. OpenAI product release, 2025. URL https://openai.com/ index/introducing-4o-image-generation/

2025
[35]

GPT Image 1

OpenAI. GPT Image 1. OpenAI API model card, 2025. URL https://platform.openai.com/docs/ models/gpt-image-1

2025
[36]

GPT Image 1.5

OpenAI. GPT Image 1.5. OpenAI API model card, 2025. URL https://platform.openai.com/ docs/models/gpt-image-1.5

2025
[37]

GPT Image 2

OpenAI. GPT Image 2. OpenAI API model card, 2026. URL https://platform.openai.com/docs/ models/gpt-image-2

2026
[38]

Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-Banana-400K: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025. URLhttps://arxiv.org/abs/2510.19808

work page arXiv 2025
[39]

Uni-CoT: Towards unified chain-of-thought reasoning across text and vision

Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Haoyu Pan, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, and Hao Li. Uni-CoT: Towards unified chain-of-thought reasoning across text and vision. InInternational Conference on Learning Representations (ICLR), 2026. URL https://openreview. net/forum?id=5nevWRoNjn

2026
[40]

Qwen3.5-27B

Qwen Team. Qwen3.5-27B. Hugging Face model card, 2026. URL https://huggingface.co/Qwen/ Qwen3.5-27B

2026
[41]

Introducing nano banana pro

Naina Raisinghani. Introducing nano banana pro. Google Blog; introduces Gemini 3 Pro Image, 2025. URLhttps://blog.google/innovation-and-ai/products/nano-banana-pro

2025
[42]

Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation

Johannes Schusterbauer, Ming Gui, Yusong Li, Pingchuan Ma, Felix Krause, and Björn Ommer. Denoising, fast and slow: Difficulty-aware adaptive sampling for image generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. URL https://arxiv.org/abs/2604.19141

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Image editing in Gemini just got a major upgrade

David Sharon and Nicole Brichtova. Image editing in Gemini just got a major upgrade. Google Blog; introduces Gemini 2.5 Flash Image / Nano Banana, 2025. URL https://blog.google/products/ gemini/updated-image-editing-model

2025
[44]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

P+: Extended textual conditioning in text-to-image generation.arXiv preprint arXiv:2303.09522, 2023

Andrey V oynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. P+: Extended textual conditioning in text-to-image generation.arXiv preprint arXiv:2303.09522, 2023. URL https://arxiv.org/abs/ 2303.09522

work page arXiv 2023
[47]

URLhttps://arxiv.org/abs/2506.05083

work page arXiv
[48]

OmniEdit: Building image editing generalist models through specialist supervision

Cong Wei, Zheyang Xiong, Weiming Ren, Xeron Du, Ge Zhang, and Wenhu Chen. OmniEdit: Building image editing generalist models through specialist supervision. InInternational Conference on Learning Representations (ICLR), 2025. URLhttps://openreview.net/forum?id=Hlm0cga0sv

2025
[49]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

KRIS-Bench: Benchmarking next-level intelligent image editing models

Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, and Xu Yang. KRIS-Bench: Benchmarking next-level intelligent image editing models. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. URLhttps://openreview.net/forum?id=aWSh1Ec64T

2025
[51]

ImgEdit: A unified image editing dataset and benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. ImgEdit: A unified image editing dataset and benchmark. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum? id=uUCSrMlfD3

2025
[52]

Nano-consistent-150k

Yejy53. Nano-consistent-150k. Hugging Face dataset, 2025. URL https://huggingface.co/ datasets/Yejy53/Nano-consistent-150k

2025
[53]

ReasonEdit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025

Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, Pengtao Chen, Xiangyu Zhang, Daxin Jiang, Xianfang Zeng, and Gang Yu. ReasonEdit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025. URL https://arxiv.org/abs/2511.22625

work page arXiv 2025
[54]

ProSpect: Prompt spectrum for attribute-aware personalization of diffusion models.ACM Transactions on Graphics (SIGGRAPH Asia), 42(6):244:1–244:14, 2023

Yuxin Zhang, Weiming Dong, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Tong-Yee Lee, Oliver Deussen, and Changsheng Xu. ProSpect: Prompt spectrum for attribute-aware personalization of diffusion models.ACM Transactions on Graphics (SIGGRAPH Asia), 42(6):244:1–244:14, 2023. doi: 10.1145/3618342. URLhttps://doi.org/10.1145/3618342

work page doi:10.1145/3618342 2023
[55]

UltraEdit: Instruction-based fine-grained image editing at scale

Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. UltraEdit: Instruction-based fine-grained image editing at scale. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=9ZDdlgH6O8

2024
[56]

Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing

Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, and Haodong Duan. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. URL https:...

2025
[57]

edit the image to show the polar bear on a single, large ice floe with no surrounding ice pieces in the water

Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, and Hongsheng Li. From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15329–15339, October 2025. U...

2025