pith. machine review for the scientific record. sign in

arxiv: 2605.12724 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Inline Critic Steers Image Editing

Jason Kuen, Kangning Liu, Mang Tik Chiu, Weitai Kang, Xiaohang Zhan, Yan Yan, Yizhou Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image editinginline criticfrozen modelintermediate layerserror patternsteering generationdiffusion modelrefinement token
0
0 comments X

The pith

A learnable critic token inserted at intermediate layers steers a frozen image-editing model to refine its predictions during the forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that error patterns in image editing are already fixed in early layers of the model, with a rank correlation of 0.83 to the final error map. It introduces Inline Critic, a learnable token that critiques predictions at those layers and adjusts hidden states to steer generation without unfreezing the backbone. This enables correction while computation is still ongoing, rather than after a full image is produced. A sympathetic reader would care because it points to a more efficient way to handle the uneven difficulty of editing different image regions.

Core claim

Although generation capability appears only in the last few layers of a frozen image-editing model, the error pattern is already determined early, with rank correlation 0.83 to the final-layer error map. Inline Critic is a learnable token that critiques the model's intermediate predictions and steers its hidden states to refine the output during the single forward pass. A three-stage training recipe stabilizes learning from critique to actual steering. This produces state-of-the-art scores of 7.89 on GEdit-Bench, a 9.4-point gain on RISEBench over the same backbone, and 81.92 on KRIS-Bench, exceeding GPT-4o. Analyses confirm the token genuinely alters attention and prediction updates in the

What carries the argument

Inline Critic, a learnable token that critiques predictions at intermediate layers and steers hidden states

If this is right

  • The critic token alters attention patterns in layers after its insertion.
  • Performance gains appear on region-aware editing benchmarks without retraining the full model.
  • A three-stage training process allows stable transition from learning critiques to applying steering.
  • Error correction happens inside one forward pass rather than after image completion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same early-error signal could be tested in text or video generation models that also show late emergence of capability.
  • If the correlation between early and final errors holds across more architectures, it would support lightweight adaptation layers instead of full fine-tuning.
  • This approach might reduce the need for multi-step refinement loops by catching mistakes before they fully form.

Load-bearing premise

The early-layer error pattern can be translated by the critic token into effective steering of later hidden states without destabilizing the frozen backbone.

What would settle it

Measure whether inserting the critic token produces no measurable change in final edit quality on GEdit-Bench or leaves the attention maps in later layers statistically unchanged from the baseline.

Figures

Figures reproduced from arXiv: 2605.12724 by Jason Kuen, Kangning Liu, Mang Tik Chiu, Weitai Kang, Xiaohang Zhan, Yan Yan, Yizhou Wang.

Figure 1
Figure 1. Figure 1: Motivation for inline critic. Top: On a frozen model, we add lightweight heads at intermediate blocks (L6–L54) to probe the generation outputs, with the original final output at L60. Middle: Each head runs its own multi-step denoising to generate an image. Bottom: Based on the final output, we compute a MSE map against the ground truth in each block. After 8x pooling and rank-normalization, we measure the … view at source ↗
Figure 2
Figure 2. Figure 2: Three-stage training of Inline Critic. Left: Stage 1 trains probe heads at several intermediate layers to predict the generation target from frozen hidden states. Middle: Stage 2 adds a critic token that is trained, at every probed block, to predict the error of Stage-1’s probes; an attention mask hides it from other tokens so the model and the probes are unaffected. Right: Stage 3 removes the mask and tra… view at source ↗
Figure 3
Figure 3. Figure 3: Heterogeneity results on ImgEdit. Bins are formed by baseline performance (left) and edit type (right). Our method’s improvement concentrates on the hard cases [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Top: per-sample histogram of ρ between the different model-side spatial map and the critic’s prediction. Bottom: one sample showing the spatial map (left) and critic’s prediction (right). Experiment C: generation shift by critic’s causal intervention. We finally test whether the alignment observed in Experiment B is genuinely causal. To this end, we run inference twice on the same input: once without maski… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results. We visualize the probe-layer error maps across backbone layers (L6→L54), together with the input image (left) and the final edited result (right). C Limitations The critic is trained to predict the per-position MSE of the probe head, but the training data can be highly noisy. Therefore it is not always a faithful measure of generation quality, which leaves the critic’s learning signal … view at source ↗
read the original abstract

Instruction-based image editing exhibits heterogeneous difficulty not only across cases but also across regions of an image, motivating refinement approaches that allocate correction to where the model struggles. Existing refinement signals arrive late, after a fully generated image or a completed denoising step. We ask whether such a signal can act within an ongoing forward pass. To investigate this, we probe a frozen image-editing model and find that although generation capability emerges only in the last few layers, the error pattern is already set in early layers (rank correlation \r{ho} = 0.83 with the final-layer error map). Based on this, we introduce Inline Critic, a learnable token that critiques a frozen model's predictions at its intermediate layers and steers its hidden states to refine generation during the forward pass. A three-stage recipe is proposed to stabilize the training from learning how to critique to steering generation. As a result, we achieve state of the art on GEdit-Bench (7.89), a +9.4 gain on RISEBench over the same backbone, and the strongest open-source result on KRIS-Bench (81.92, surpassing GPT-4o). We further provide analyses showing that the critic genuinely shapes the model's attention and prediction updates at subsequent layers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper probes a frozen image-editing model and reports that error patterns are already established in early layers (rank correlation ρ = 0.83 with the final error map) even though generation capability emerges only in the last layers. It introduces Inline Critic, a learnable token that critiques predictions at intermediate layers and steers hidden states during the forward pass. A three-stage training recipe stabilizes the process from critique learning to steering. The method yields SOTA results on GEdit-Bench (7.89), a +9.4 gain on RISEBench over the same backbone, and the strongest open-source score on KRIS-Bench (81.92, exceeding GPT-4o), with analyses showing effects on attention and subsequent predictions.

Significance. If the early-error correlation can be reliably translated into targeted steering without destabilizing the frozen backbone, the approach offers a novel inline refinement mechanism that could improve efficiency in transformer-based image editing. The quantified benchmark gains and attention analyses provide concrete support for the premise, though the absence of internal parameter-free derivations or multi-backbone tests limits broader generalizability.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (probing and critic design): the claim that the observed ρ = 0.83 rank correlation between early and final error maps can be directly leveraged by the learnable critic token to produce effective steering lacks supporting quantitative evidence such as pre-/post-intervention error-map correlations or targeted ablations; this assumption is load-bearing for the central performance claims.
  2. [§4] §4 (three-stage recipe): no stability metrics (e.g., hidden-state norm changes or gradient magnitudes on the frozen backbone) or direct comparisons to single-stage training are reported, leaving open the possibility that the recipe is required precisely because direct use of the correlation destabilizes the model.
  3. [§5] §5 (analyses): while attention shaping and prediction updates are demonstrated, the section does not quantify whether these changes reduce error in the regions identified by the early error maps (e.g., via spatial correlation between attention deltas and error reduction), weakening the link between the critic mechanism and the reported gains.
minor comments (2)
  1. [Abstract] Abstract: the notation “rank correlation ρ = 0.83” is written as “rank correlation r{ho} = 0.83”; standardize to ρ throughout.
  2. [Experiments] Experiments: benchmark scores should include standard deviations across runs or statistical significance tests to substantiate the +9.4 gain and SOTA claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and will revise the manuscript to incorporate additional quantitative evidence and analyses as requested.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (probing and critic design): the claim that the observed ρ = 0.83 rank correlation between early and final error maps can be directly leveraged by the learnable critic token to produce effective steering lacks supporting quantitative evidence such as pre-/post-intervention error-map correlations or targeted ablations; this assumption is load-bearing for the central performance claims.

    Authors: The observed rank correlation of ρ = 0.83 between early- and final-layer error maps provides the motivation for placing the critic at intermediate layers. The central performance claims are supported by the consistent gains across GEdit-Bench, RISEBench, and KRIS-Bench together with the attention and prediction-update analyses. We nevertheless agree that direct quantitative linkage is desirable and will add pre-/post-intervention error-map correlations as well as targeted ablations in the revised version. revision: yes

  2. Referee: [§4] §4 (three-stage recipe): no stability metrics (e.g., hidden-state norm changes or gradient magnitudes on the frozen backbone) or direct comparisons to single-stage training are reported, leaving open the possibility that the recipe is required precisely because direct use of the correlation destabilizes the model.

    Authors: The three-stage recipe was introduced precisely to ensure stable training when the critic begins to steer hidden states. We accept that stability metrics (hidden-state norm changes, gradient magnitudes on the frozen backbone) and explicit single-stage comparisons would strengthen the presentation and will include them in the revision. revision: yes

  3. Referee: [§5] §5 (analyses): while attention shaping and prediction updates are demonstrated, the section does not quantify whether these changes reduce error in the regions identified by the early error maps (e.g., via spatial correlation between attention deltas and error reduction), weakening the link between the critic mechanism and the reported gains.

    Authors: The analyses already demonstrate that the critic alters attention patterns and subsequent predictions. To make the connection to regional error reduction explicit, we will add spatial-correlation measurements between attention deltas and error-map reductions in the regions flagged by the early-layer error maps. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical probing and staged training yield benchmark gains without self-referential reduction

full rationale

The paper's chain consists of an empirical observation (early-layer error pattern with ρ=0.83 correlation to final error map, obtained by probing a frozen backbone) followed by introduction of a learnable critic token and a three-stage training procedure to enable steering. These steps do not reduce to their inputs by construction: the correlation is measured externally, the critic parameters are optimized against benchmark objectives, and performance claims rest on independent evaluations (GEdit-Bench, RISEBench, KRIS-Bench) rather than any fitted quantity being renamed as a prediction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no derivation equates the output to the input definition. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method rests on the empirical observation that error patterns stabilize early and on the assumption that a critic token can be trained to act on them; no free parameters are explicitly fitted in the abstract description.

axioms (1)
  • domain assumption Error pattern observed in early layers correlates strongly (ρ = 0.83) with final-layer error map
    Derived from probing experiments on the frozen model
invented entities (1)
  • Inline Critic learnable token no independent evidence
    purpose: Critique intermediate predictions and steer hidden states during forward pass
    New component introduced to enable mid-generation refinement

pith-pipeline@v0.9.0 · 5531 in / 1355 out tokens · 50102 ms · 2026-05-14T20:48:07.135910+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

54 extracted references · 25 canonical work pages · 7 internal anchors

  1. [1]

    Self-rectifying diffusion sampling with perturbed-attention guidance

    Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, and Seungryong Kim. Self-rectifying diffusion sampling with perturbed-attention guidance. InComputer Vision – ECCV 2024, pages 1–17, 2024. doi: 10.1007/978-3-031-73464-9_1. URL https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/09184.pdf. 9

  2. [2]

    A neural space-time representation for text-to-image personalization.ACM Transactions on Graphics (SIGGRAPH Asia), 42(6):243:1–243:10,

    Yuval Alaluf, Elad Richardson, Gal Metzer, and Daniel Cohen-Or. A neural space-time representation for text-to-image personalization.ACM Transactions on Graphics (SIGGRAPH Asia), 42(6):243:1–243:10,

  3. [3]

    URLhttps://doi.org/10.1145/3618322

    doi: 10.1145/3618322. URLhttps://doi.org/10.1145/3618322

  4. [5]

    URLhttps://arxiv.org/abs/2502.13923

  5. [6]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 Kontext: Flow matching for in-context image ...

  6. [7]

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. MaskGIT: Masked generative image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11315–11325, June 2022. URL https://openaccess.thecvf.com/content/CVPR2022/html/Chang_MaskGIT_Masked_ Generative_Image_Transformer_CVPR_2022_paper.html

  7. [8]

    Attend-and-excite: Attention- based semantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (SIGGRAPH), 42(4):1–10, 2023

    Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention- based semantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (SIGGRAPH), 42(4):1–10, 2023. doi: 10.1145/3592116. URLhttps://doi.org/10.1145/3592116

  8. [9]

    ShareGPT-4o-Image: Aligning multimodal models with GPT-4o-level image generation

    Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. ShareGPT-4o-Image: Aligning multimodal models with GPT-4o-level image generation. arXiv preprint arXiv:2506.18095, 2025. URLhttps://arxiv.org/abs/2506.18095

  9. [10]

    OpenGPT-4o-Image: A compre- hensive dataset for advanced image generation and editing.arXiv preprint arXiv:2509.24900, 2025

    Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, Pengfei Wan, and Yi-Fan Zhang. OpenGPT-4o-Image: A compre- hensive dataset for advanced image generation and editing.arXiv preprint arXiv:2509.24900, 2025. URL https://arxiv.org/abs/2509.24900

  10. [11]

    Diffusion model guided sampling with pixel- wise aleatoric uncertainty estimation

    Michele De Vita and Vasileios Belagiannis. Diffusion model guided sampling with pixel- wise aleatoric uncertainty estimation. InProceedings of the Winter Conference on Ap- plications of Computer Vision (WACV), pages 3844–3854, February 2025. URL https: //openaccess.thecvf.com/content/WACV2025/html/De_Vita_Diffusion_Model_Guided_ Sampling_with_Pixel-Wise_A...

  11. [12]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. URLhttps://arxiv.org/abs/2505.14683

  12. [13]

    DreamArtist++: Controllable one-shot text-to-image generation via positive-negative adapter.arXiv preprint arXiv:2211.11337, 2022

    Ziyi Dong, Pengxu Wei, and Liang Lin. DreamArtist++: Controllable one-shot text-to-image generation via positive-negative adapter.arXiv preprint arXiv:2211.11337, 2022. URL https://arxiv.org/abs/ 2211.11337

  13. [14]

    Diffusion self-guidance for controllable image generation

    Dave Epstein, Allan Jabri, Ben Poole, Alexei Efros, and Aleksander Holynski. Diffusion self-guidance for controllable image generation. InAdvances in Neural Information Processing Systems (NeurIPS), vol- ume 36, pages 16222–16239. Curran Associates, Inc., 2023. URL https://papers.nips.cc/paper_ files/paper/2023/hash/3469b211b829b39d2b0cfd3b880a869c-Abstra...

  14. [15]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning, volu...

  15. [16]

    An image is worth one word: Personalizing text-to-image generation using textual inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/ forum?id=NAQvF08TcyG. 10

  16. [17]

    UniREditBench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025

    Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, and Jiaqi Wang. UniREditBench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025. URLhttps://arxiv.org/abs/2511.01295

  17. [18]

    Ziqi Huang, Tianxing Wu, Yuming Jiang, Kelvin C. K. Chan, and Ziwei Liu. ReVersion: Diffusion-based relation inversion from images. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. doi: 10.1145/3680528.3687658. URLhttps://ziqihuangg.github.io/projects/reversion

  18. [19]

    HQ-Edit: A high-quality dataset for instruction-based image editing

    Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Cihang Xie, and Yuyin Zhou. HQ-Edit: A high-quality dataset for instruction-based image editing. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=mZptYYttFj

  19. [20]

    Experiment with Gemini 2.0 Flash native image gener- ation

    Kat Kampf and Nicole Brichtova. Experiment with Gemini 2.0 Flash native image gener- ation. Google Developers Blog, 2025. URL https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation

  20. [21]

    Guiding a diffusion model with a bad version of itself

    Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), volume 37, pages 52996–53021. Curran Associates, Inc., 2024. doi: 10. 52202/079017-1679. URL https://proceedings.neurips.cc/paper_files/paper...

  21. [22]

    BayesDiff: Estimating pixel-wise uncertainty in diffusion via bayesian inference

    Siqi Kou, Lei Gan, Dequan Wang, Chongxuan Li, and Zhijie Deng. BayesDiff: Estimating pixel-wise uncertainty in diffusion via bayesian inference. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://openreview.net/forum?id=YcM6ofShwY

  22. [23]

    Feedback guidance of diffusion models

    Felix Koulischer, Florian Handke, Johannes Deleu, Thomas Demeester, and Luca Ambrogioni. Feedback guidance of diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URLhttps://openreview.net/forum?id=8ySOcf7UpM

  23. [24]

    Improved masked image generation with token- critic

    José Lezama, Huiwen Chang, Lu Jiang, and Irfan Essa. Improved masked image generation with token- critic. InComputer Vision – ECCV 2022, pages 70–86, 2022. doi: 10.1007/978-3-031-20050-2_5. URL https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/2901_ECCV_2022_paper.php

  24. [25]

    ThinkRL-Edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026

    Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, and Deng Cai. ThinkRL-Edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026. URLhttps://arxiv.org/abs/2601.03467

  25. [26]

    Reflect-DiT: Inference-time scaling for text-to-image diffusion transformers via in-context reflection.arXiv preprint arXiv:2503.12271, 2025

    Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Reflect-DiT: Inference-time scaling for text-to-image diffusion transformers via in-context reflection.arXiv preprint arXiv:2503.12271, 2025. URLhttps://arxiv.org/abs/2503.12271

  26. [27]

    Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025

    Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, and Li Yuan. Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025. URLhttps://arxiv.org/abs/2510.16888

  27. [28]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t

  28. [30]

    URLhttps://arxiv.org/abs/2504.17761

  29. [31]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=XVjTT1nw5z

  30. [32]

    EditScore: Unlocking online RL for image editing via high-fidelity reward modeling

    Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, and Zheng Liu. EditScore: Unlocking online RL for image editing via high-fidelity reward modeling. In International Conference on Learning Representations (ICLR), 2026. URL https://openreview.net/ forum?id=E7YpL4L4Xh. 11

  31. [33]

    Rusu, Joel Veness, Marc G

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement ...

  32. [34]

    Introducing 4o image generation

    OpenAI. Introducing 4o image generation. OpenAI product release, 2025. URL https://openai.com/ index/introducing-4o-image-generation/

  33. [35]

    GPT Image 1

    OpenAI. GPT Image 1. OpenAI API model card, 2025. URL https://platform.openai.com/docs/ models/gpt-image-1

  34. [36]

    GPT Image 1.5

    OpenAI. GPT Image 1.5. OpenAI API model card, 2025. URL https://platform.openai.com/ docs/models/gpt-image-1.5

  35. [37]

    GPT Image 2

    OpenAI. GPT Image 2. OpenAI API model card, 2026. URL https://platform.openai.com/docs/ models/gpt-image-2

  36. [38]

    Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

    Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-Banana-400K: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025. URLhttps://arxiv.org/abs/2510.19808

  37. [39]

    Uni-CoT: Towards unified chain-of-thought reasoning across text and vision

    Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Haoyu Pan, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, and Hao Li. Uni-CoT: Towards unified chain-of-thought reasoning across text and vision. InInternational Conference on Learning Representations (ICLR), 2026. URL https://openreview. net/forum?id=5nevWRoNjn

  38. [40]

    Qwen3.5-27B

    Qwen Team. Qwen3.5-27B. Hugging Face model card, 2026. URL https://huggingface.co/Qwen/ Qwen3.5-27B

  39. [41]

    Introducing nano banana pro

    Naina Raisinghani. Introducing nano banana pro. Google Blog; introduces Gemini 3 Pro Image, 2025. URLhttps://blog.google/innovation-and-ai/products/nano-banana-pro

  40. [42]

    Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation

    Johannes Schusterbauer, Ming Gui, Yusong Li, Pingchuan Ma, Felix Krause, and Björn Ommer. Denoising, fast and slow: Difficulty-aware adaptive sampling for image generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. URL https://arxiv.org/abs/2604.19141

  41. [43]

    Image editing in Gemini just got a major upgrade

    David Sharon and Nicole Brichtova. Image editing in Gemini just got a major upgrade. Google Blog; introduces Gemini 2.5 Flash Image / Nano Banana, 2025. URL https://blog.google/products/ gemini/updated-image-editing-model

  42. [44]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  43. [45]

    P+: Extended textual conditioning in text-to-image generation.arXiv preprint arXiv:2303.09522, 2023

    Andrey V oynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. P+: Extended textual conditioning in text-to-image generation.arXiv preprint arXiv:2303.09522, 2023. URL https://arxiv.org/abs/ 2303.09522

  44. [47]

    URLhttps://arxiv.org/abs/2506.05083

  45. [48]

    OmniEdit: Building image editing generalist models through specialist supervision

    Cong Wei, Zheyang Xiong, Weiming Ren, Xeron Du, Ge Zhang, and Wenhu Chen. OmniEdit: Building image editing generalist models through specialist supervision. InInternational Conference on Learning Representations (ICLR), 2025. URLhttps://openreview.net/forum?id=Hlm0cga0sv

  46. [49]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  47. [50]

    KRIS-Bench: Benchmarking next-level intelligent image editing models

    Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, and Xu Yang. KRIS-Bench: Benchmarking next-level intelligent image editing models. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. URLhttps://openreview.net/forum?id=aWSh1Ec64T

  48. [51]

    ImgEdit: A unified image editing dataset and benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. ImgEdit: A unified image editing dataset and benchmark. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum? id=uUCSrMlfD3

  49. [52]

    Nano-consistent-150k

    Yejy53. Nano-consistent-150k. Hugging Face dataset, 2025. URL https://huggingface.co/ datasets/Yejy53/Nano-consistent-150k

  50. [53]

    ReasonEdit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025

    Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, Pengtao Chen, Xiangyu Zhang, Daxin Jiang, Xianfang Zeng, and Gang Yu. ReasonEdit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025. URL https://arxiv.org/abs/2511.22625

  51. [54]

    ProSpect: Prompt spectrum for attribute-aware personalization of diffusion models.ACM Transactions on Graphics (SIGGRAPH Asia), 42(6):244:1–244:14, 2023

    Yuxin Zhang, Weiming Dong, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Tong-Yee Lee, Oliver Deussen, and Changsheng Xu. ProSpect: Prompt spectrum for attribute-aware personalization of diffusion models.ACM Transactions on Graphics (SIGGRAPH Asia), 42(6):244:1–244:14, 2023. doi: 10.1145/3618342. URLhttps://doi.org/10.1145/3618342

  52. [55]

    UltraEdit: Instruction-based fine-grained image editing at scale

    Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. UltraEdit: Instruction-based fine-grained image editing at scale. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=9ZDdlgH6O8

  53. [56]

    Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing

    Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, and Haodong Duan. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. URL https:...

  54. [57]

    edit the image to show the polar bear on a single, large ice floe with no surrounding ice pieces in the water

    Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, and Hongsheng Li. From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15329–15339, October 2025. U...