Recognition: 1 theorem link
· Lean TheoremInline Critic Steers Image Editing
Pith reviewed 2026-05-14 20:48 UTC · model grok-4.3
The pith
A learnable critic token inserted at intermediate layers steers a frozen image-editing model to refine its predictions during the forward pass.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Although generation capability appears only in the last few layers of a frozen image-editing model, the error pattern is already determined early, with rank correlation 0.83 to the final-layer error map. Inline Critic is a learnable token that critiques the model's intermediate predictions and steers its hidden states to refine the output during the single forward pass. A three-stage training recipe stabilizes learning from critique to actual steering. This produces state-of-the-art scores of 7.89 on GEdit-Bench, a 9.4-point gain on RISEBench over the same backbone, and 81.92 on KRIS-Bench, exceeding GPT-4o. Analyses confirm the token genuinely alters attention and prediction updates in the
What carries the argument
Inline Critic, a learnable token that critiques predictions at intermediate layers and steers hidden states
If this is right
- The critic token alters attention patterns in layers after its insertion.
- Performance gains appear on region-aware editing benchmarks without retraining the full model.
- A three-stage training process allows stable transition from learning critiques to applying steering.
- Error correction happens inside one forward pass rather than after image completion.
Where Pith is reading between the lines
- The same early-error signal could be tested in text or video generation models that also show late emergence of capability.
- If the correlation between early and final errors holds across more architectures, it would support lightweight adaptation layers instead of full fine-tuning.
- This approach might reduce the need for multi-step refinement loops by catching mistakes before they fully form.
Load-bearing premise
The early-layer error pattern can be translated by the critic token into effective steering of later hidden states without destabilizing the frozen backbone.
What would settle it
Measure whether inserting the critic token produces no measurable change in final edit quality on GEdit-Bench or leaves the attention maps in later layers statistically unchanged from the baseline.
Figures
read the original abstract
Instruction-based image editing exhibits heterogeneous difficulty not only across cases but also across regions of an image, motivating refinement approaches that allocate correction to where the model struggles. Existing refinement signals arrive late, after a fully generated image or a completed denoising step. We ask whether such a signal can act within an ongoing forward pass. To investigate this, we probe a frozen image-editing model and find that although generation capability emerges only in the last few layers, the error pattern is already set in early layers (rank correlation \r{ho} = 0.83 with the final-layer error map). Based on this, we introduce Inline Critic, a learnable token that critiques a frozen model's predictions at its intermediate layers and steers its hidden states to refine generation during the forward pass. A three-stage recipe is proposed to stabilize the training from learning how to critique to steering generation. As a result, we achieve state of the art on GEdit-Bench (7.89), a +9.4 gain on RISEBench over the same backbone, and the strongest open-source result on KRIS-Bench (81.92, surpassing GPT-4o). We further provide analyses showing that the critic genuinely shapes the model's attention and prediction updates at subsequent layers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper probes a frozen image-editing model and reports that error patterns are already established in early layers (rank correlation ρ = 0.83 with the final error map) even though generation capability emerges only in the last layers. It introduces Inline Critic, a learnable token that critiques predictions at intermediate layers and steers hidden states during the forward pass. A three-stage training recipe stabilizes the process from critique learning to steering. The method yields SOTA results on GEdit-Bench (7.89), a +9.4 gain on RISEBench over the same backbone, and the strongest open-source score on KRIS-Bench (81.92, exceeding GPT-4o), with analyses showing effects on attention and subsequent predictions.
Significance. If the early-error correlation can be reliably translated into targeted steering without destabilizing the frozen backbone, the approach offers a novel inline refinement mechanism that could improve efficiency in transformer-based image editing. The quantified benchmark gains and attention analyses provide concrete support for the premise, though the absence of internal parameter-free derivations or multi-backbone tests limits broader generalizability.
major comments (3)
- [Abstract and §3] Abstract and §3 (probing and critic design): the claim that the observed ρ = 0.83 rank correlation between early and final error maps can be directly leveraged by the learnable critic token to produce effective steering lacks supporting quantitative evidence such as pre-/post-intervention error-map correlations or targeted ablations; this assumption is load-bearing for the central performance claims.
- [§4] §4 (three-stage recipe): no stability metrics (e.g., hidden-state norm changes or gradient magnitudes on the frozen backbone) or direct comparisons to single-stage training are reported, leaving open the possibility that the recipe is required precisely because direct use of the correlation destabilizes the model.
- [§5] §5 (analyses): while attention shaping and prediction updates are demonstrated, the section does not quantify whether these changes reduce error in the regions identified by the early error maps (e.g., via spatial correlation between attention deltas and error reduction), weakening the link between the critic mechanism and the reported gains.
minor comments (2)
- [Abstract] Abstract: the notation “rank correlation ρ = 0.83” is written as “rank correlation r{ho} = 0.83”; standardize to ρ throughout.
- [Experiments] Experiments: benchmark scores should include standard deviations across runs or statistical significance tests to substantiate the +9.4 gain and SOTA claims.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major comment below and will revise the manuscript to incorporate additional quantitative evidence and analyses as requested.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (probing and critic design): the claim that the observed ρ = 0.83 rank correlation between early and final error maps can be directly leveraged by the learnable critic token to produce effective steering lacks supporting quantitative evidence such as pre-/post-intervention error-map correlations or targeted ablations; this assumption is load-bearing for the central performance claims.
Authors: The observed rank correlation of ρ = 0.83 between early- and final-layer error maps provides the motivation for placing the critic at intermediate layers. The central performance claims are supported by the consistent gains across GEdit-Bench, RISEBench, and KRIS-Bench together with the attention and prediction-update analyses. We nevertheless agree that direct quantitative linkage is desirable and will add pre-/post-intervention error-map correlations as well as targeted ablations in the revised version. revision: yes
-
Referee: [§4] §4 (three-stage recipe): no stability metrics (e.g., hidden-state norm changes or gradient magnitudes on the frozen backbone) or direct comparisons to single-stage training are reported, leaving open the possibility that the recipe is required precisely because direct use of the correlation destabilizes the model.
Authors: The three-stage recipe was introduced precisely to ensure stable training when the critic begins to steer hidden states. We accept that stability metrics (hidden-state norm changes, gradient magnitudes on the frozen backbone) and explicit single-stage comparisons would strengthen the presentation and will include them in the revision. revision: yes
-
Referee: [§5] §5 (analyses): while attention shaping and prediction updates are demonstrated, the section does not quantify whether these changes reduce error in the regions identified by the early error maps (e.g., via spatial correlation between attention deltas and error reduction), weakening the link between the critic mechanism and the reported gains.
Authors: The analyses already demonstrate that the critic alters attention patterns and subsequent predictions. To make the connection to regional error reduction explicit, we will add spatial-correlation measurements between attention deltas and error-map reductions in the regions flagged by the early-layer error maps. revision: yes
Circularity Check
No circularity: empirical probing and staged training yield benchmark gains without self-referential reduction
full rationale
The paper's chain consists of an empirical observation (early-layer error pattern with ρ=0.83 correlation to final error map, obtained by probing a frozen backbone) followed by introduction of a learnable critic token and a three-stage training procedure to enable steering. These steps do not reduce to their inputs by construction: the correlation is measured externally, the critic parameters are optimized against benchmark objectives, and performance claims rest on independent evaluations (GEdit-Bench, RISEBench, KRIS-Bench) rather than any fitted quantity being renamed as a prediction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no derivation equates the output to the input definition. The method is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Error pattern observed in early layers correlates strongly (ρ = 0.83) with final-layer error map
invented entities (1)
-
Inline Critic learnable token
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearalthough generation capability emerges only in the last few layers, the error pattern is already set in early layers (rank correlation ρ=0.83 with the final-layer error map)
Reference graph
Works this paper leans on
-
[1]
Self-rectifying diffusion sampling with perturbed-attention guidance
Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, and Seungryong Kim. Self-rectifying diffusion sampling with perturbed-attention guidance. InComputer Vision – ECCV 2024, pages 1–17, 2024. doi: 10.1007/978-3-031-73464-9_1. URL https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/09184.pdf. 9
-
[2]
A neural space-time representation for text-to-image personalization.ACM Transactions on Graphics (SIGGRAPH Asia), 42(6):243:1–243:10,
Yuval Alaluf, Elad Richardson, Gal Metzer, and Daniel Cohen-Or. A neural space-time representation for text-to-image personalization.ACM Transactions on Graphics (SIGGRAPH Asia), 42(6):243:1–243:10,
-
[3]
URLhttps://doi.org/10.1145/3618322
doi: 10.1145/3618322. URLhttps://doi.org/10.1145/3618322
-
[5]
URLhttps://arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 Kontext: Flow matching for in-context image ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. MaskGIT: Masked generative image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11315–11325, June 2022. URL https://openaccess.thecvf.com/content/CVPR2022/html/Chang_MaskGIT_Masked_ Generative_Image_Transformer_CVPR_2022_paper.html
2022
-
[8]
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention- based semantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (SIGGRAPH), 42(4):1–10, 2023. doi: 10.1145/3592116. URLhttps://doi.org/10.1145/3592116
-
[9]
ShareGPT-4o-Image: Aligning multimodal models with GPT-4o-level image generation
Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. ShareGPT-4o-Image: Aligning multimodal models with GPT-4o-level image generation. arXiv preprint arXiv:2506.18095, 2025. URLhttps://arxiv.org/abs/2506.18095
-
[10]
Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, Pengfei Wan, and Yi-Fan Zhang. OpenGPT-4o-Image: A compre- hensive dataset for advanced image generation and editing.arXiv preprint arXiv:2509.24900, 2025. URL https://arxiv.org/abs/2509.24900
-
[11]
Diffusion model guided sampling with pixel- wise aleatoric uncertainty estimation
Michele De Vita and Vasileios Belagiannis. Diffusion model guided sampling with pixel- wise aleatoric uncertainty estimation. InProceedings of the Winter Conference on Ap- plications of Computer Vision (WACV), pages 3844–3854, February 2025. URL https: //openaccess.thecvf.com/content/WACV2025/html/De_Vita_Diffusion_Model_Guided_ Sampling_with_Pixel-Wise_A...
2025
-
[12]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. URLhttps://arxiv.org/abs/2505.14683
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Ziyi Dong, Pengxu Wei, and Liang Lin. DreamArtist++: Controllable one-shot text-to-image generation via positive-negative adapter.arXiv preprint arXiv:2211.11337, 2022. URL https://arxiv.org/abs/ 2211.11337
-
[14]
Diffusion self-guidance for controllable image generation
Dave Epstein, Allan Jabri, Ben Poole, Alexei Efros, and Aleksander Holynski. Diffusion self-guidance for controllable image generation. InAdvances in Neural Information Processing Systems (NeurIPS), vol- ume 36, pages 16222–16239. Curran Associates, Inc., 2023. URL https://papers.nips.cc/paper_ files/paper/2023/hash/3469b211b829b39d2b0cfd3b880a869c-Abstra...
2023
-
[15]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning, volu...
2024
-
[16]
An image is worth one word: Personalizing text-to-image generation using textual inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/ forum?id=NAQvF08TcyG. 10
2023
-
[17]
Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, and Jiaqi Wang. UniREditBench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025. URLhttps://arxiv.org/abs/2511.01295
-
[18]
Ziqi Huang, Tianxing Wu, Yuming Jiang, Kelvin C. K. Chan, and Ziwei Liu. ReVersion: Diffusion-based relation inversion from images. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. doi: 10.1145/3680528.3687658. URLhttps://ziqihuangg.github.io/projects/reversion
-
[19]
HQ-Edit: A high-quality dataset for instruction-based image editing
Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Cihang Xie, and Yuyin Zhou. HQ-Edit: A high-quality dataset for instruction-based image editing. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=mZptYYttFj
2025
-
[20]
Experiment with Gemini 2.0 Flash native image gener- ation
Kat Kampf and Nicole Brichtova. Experiment with Gemini 2.0 Flash native image gener- ation. Google Developers Blog, 2025. URL https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation
2025
-
[21]
Guiding a diffusion model with a bad version of itself
Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), volume 37, pages 52996–53021. Curran Associates, Inc., 2024. doi: 10. 52202/079017-1679. URL https://proceedings.neurips.cc/paper_files/paper...
2024
-
[22]
BayesDiff: Estimating pixel-wise uncertainty in diffusion via bayesian inference
Siqi Kou, Lei Gan, Dequan Wang, Chongxuan Li, and Zhijie Deng. BayesDiff: Estimating pixel-wise uncertainty in diffusion via bayesian inference. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://openreview.net/forum?id=YcM6ofShwY
2024
-
[23]
Feedback guidance of diffusion models
Felix Koulischer, Florian Handke, Johannes Deleu, Thomas Demeester, and Luca Ambrogioni. Feedback guidance of diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URLhttps://openreview.net/forum?id=8ySOcf7UpM
2025
-
[24]
Improved masked image generation with token- critic
José Lezama, Huiwen Chang, Lu Jiang, and Irfan Essa. Improved masked image generation with token- critic. InComputer Vision – ECCV 2022, pages 70–86, 2022. doi: 10.1007/978-3-031-20050-2_5. URL https://www.ecva.net/papers/eccv_2022/papers_ECCV/html/2901_ECCV_2022_paper.php
-
[25]
Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, and Deng Cai. ThinkRL-Edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026. URLhttps://arxiv.org/abs/2601.03467
-
[26]
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Reflect-DiT: Inference-time scaling for text-to-image diffusion transformers via in-context reflection.arXiv preprint arXiv:2503.12271, 2025. URLhttps://arxiv.org/abs/2503.12271
-
[27]
Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, and Li Yuan. Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025. URLhttps://arxiv.org/abs/2510.16888
-
[28]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t
2023
-
[30]
URLhttps://arxiv.org/abs/2504.17761
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=XVjTT1nw5z
2023
-
[32]
EditScore: Unlocking online RL for image editing via high-fidelity reward modeling
Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, and Zheng Liu. EditScore: Unlocking online RL for image editing via high-fidelity reward modeling. In International Conference on Learning Representations (ICLR), 2026. URL https://openreview.net/ forum?id=E7YpL4L4Xh. 11
2026
-
[33]
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement ...
-
[34]
Introducing 4o image generation
OpenAI. Introducing 4o image generation. OpenAI product release, 2025. URL https://openai.com/ index/introducing-4o-image-generation/
2025
-
[35]
GPT Image 1
OpenAI. GPT Image 1. OpenAI API model card, 2025. URL https://platform.openai.com/docs/ models/gpt-image-1
2025
-
[36]
GPT Image 1.5
OpenAI. GPT Image 1.5. OpenAI API model card, 2025. URL https://platform.openai.com/ docs/models/gpt-image-1.5
2025
-
[37]
GPT Image 2
OpenAI. GPT Image 2. OpenAI API model card, 2026. URL https://platform.openai.com/docs/ models/gpt-image-2
2026
-
[38]
Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-Banana-400K: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025. URLhttps://arxiv.org/abs/2510.19808
-
[39]
Uni-CoT: Towards unified chain-of-thought reasoning across text and vision
Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Haoyu Pan, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, and Hao Li. Uni-CoT: Towards unified chain-of-thought reasoning across text and vision. InInternational Conference on Learning Representations (ICLR), 2026. URL https://openreview. net/forum?id=5nevWRoNjn
2026
-
[40]
Qwen3.5-27B
Qwen Team. Qwen3.5-27B. Hugging Face model card, 2026. URL https://huggingface.co/Qwen/ Qwen3.5-27B
2026
-
[41]
Introducing nano banana pro
Naina Raisinghani. Introducing nano banana pro. Google Blog; introduces Gemini 3 Pro Image, 2025. URLhttps://blog.google/innovation-and-ai/products/nano-banana-pro
2025
-
[42]
Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
Johannes Schusterbauer, Ming Gui, Yusong Li, Pingchuan Ma, Felix Krause, and Björn Ommer. Denoising, fast and slow: Difficulty-aware adaptive sampling for image generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. URL https://arxiv.org/abs/2604.19141
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[43]
Image editing in Gemini just got a major upgrade
David Sharon and Nicole Brichtova. Image editing in Gemini just got a major upgrade. Google Blog; introduces Gemini 2.5 Flash Image / Nano Banana, 2025. URL https://blog.google/products/ gemini/updated-image-editing-model
2025
-
[44]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
P+: Extended textual conditioning in text-to-image generation.arXiv preprint arXiv:2303.09522, 2023
Andrey V oynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. P+: Extended textual conditioning in text-to-image generation.arXiv preprint arXiv:2303.09522, 2023. URL https://arxiv.org/abs/ 2303.09522
- [47]
-
[48]
OmniEdit: Building image editing generalist models through specialist supervision
Cong Wei, Zheyang Xiong, Weiming Ren, Xeron Du, Ge Zhang, and Wenhu Chen. OmniEdit: Building image editing generalist models through specialist supervision. InInternational Conference on Learning Representations (ICLR), 2025. URLhttps://openreview.net/forum?id=Hlm0cga0sv
2025
-
[49]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
KRIS-Bench: Benchmarking next-level intelligent image editing models
Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, and Xu Yang. KRIS-Bench: Benchmarking next-level intelligent image editing models. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. URLhttps://openreview.net/forum?id=aWSh1Ec64T
2025
-
[51]
ImgEdit: A unified image editing dataset and benchmark
Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. ImgEdit: A unified image editing dataset and benchmark. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum? id=uUCSrMlfD3
2025
-
[52]
Nano-consistent-150k
Yejy53. Nano-consistent-150k. Hugging Face dataset, 2025. URL https://huggingface.co/ datasets/Yejy53/Nano-consistent-150k
2025
-
[53]
ReasonEdit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025
Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, Pengtao Chen, Xiangyu Zhang, Daxin Jiang, Xianfang Zeng, and Gang Yu. ReasonEdit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025. URL https://arxiv.org/abs/2511.22625
-
[54]
Yuxin Zhang, Weiming Dong, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Tong-Yee Lee, Oliver Deussen, and Changsheng Xu. ProSpect: Prompt spectrum for attribute-aware personalization of diffusion models.ACM Transactions on Graphics (SIGGRAPH Asia), 42(6):244:1–244:14, 2023. doi: 10.1145/3618342. URLhttps://doi.org/10.1145/3618342
-
[55]
UltraEdit: Instruction-based fine-grained image editing at scale
Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. UltraEdit: Instruction-based fine-grained image editing at scale. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=9ZDdlgH6O8
2024
-
[56]
Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing
Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, and Haodong Duan. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. URL https:...
2025
-
[57]
edit the image to show the polar bear on a single, large ice floe with no surrounding ice pieces in the water
Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, and Hongsheng Li. From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15329–15339, October 2025. U...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.