arxiv: 2605.15181 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

Anirudh Sundara Rajan , Krishna Kumar Singh , Yong Jae Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:17 UTC · model grok-4.3

classification 💻 cs.CV

keywords image editingmulti-step planningorchestrationvision-language modelsoutcome rewardslong-horizon tasksexperiential learning

0 comments

The pith

Coupling a planner with a reward-driven orchestrator enables reliable multi-step image editing from abstract instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that decomposes abstract editing instructions into atomic steps via a planner and then executes those steps through an orchestrator that selects tools and regions. A vision-language judge supplies outcome-based rewards that measure instruction adherence and visual quality, allowing the orchestrator to be trained directly on editing results. Successful trajectories are fed back to refine the planner, creating a closed loop between planning and execution. This matters because existing models and handcrafted pipelines often produce incoherent results on open-ended tasks such as making an advertisement vegetarian-friendly. The approach therefore aims to move beyond imitation learning or fixed pipelines toward outcome-driven improvement.

Core claim

The central claim is that an experiential loop in which a planner produces structured atomic decompositions and an orchestrator selects and applies tools and regions, guided by rewards from a vision-language judge, yields more coherent and reliable edits than single-step models or rule-based multi-step baselines. The orchestrator is trained to maximize the judge's rewards on instruction adherence and visual quality, while high-reward trajectories are used to update the planner. This tight coupling of planning with reward-driven execution is what distinguishes the method from prior agent-based approaches that rely on handcrafted pipelines or teacher imitation.

What carries the argument

The experiential framework of a planner that generates atomic decompositions coupled with an orchestrator trained to maximize vision-language judge rewards for tool and region selection.

If this is right

The method produces more coherent results on long-horizon, abstract instructions than single-step or rule-based approaches.
Learning occurs directly from editing outcomes rather than from imitation of expert trajectories.
Successful execution trajectories can be reused to iteratively improve the planner.
The orchestrator learns to choose appropriate tools and spatial regions conditioned on the current state and step.
The overall system handles open-ended tasks that require multiple coordinated changes without handcrafted decomposition rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward-driven loop could be tested on sequential tasks outside image editing, such as code editing or scene graph manipulation, if a suitable judge is available.
If the judge generalizes across domains, the framework might reduce reliance on large amounts of human demonstration data for agent training.
Extending the planner to output probabilistic decompositions rather than fixed sequences could improve robustness when early steps have multiple valid paths.

Load-bearing premise

A vision-language judge can reliably score both instruction adherence and visual quality across diverse editing tasks without systematic errors or biases.

What would settle it

Human ratings of the method's outputs showing lower coherence or lower instruction adherence than single-step baselines on the same set of abstract multi-step tasks.

Figures

Figures reproduced from arXiv: 2605.15181 by Anirudh Sundara Rajan, Krishna Kumar Singh, Yong Jae Lee.

**Figure 1.** Figure 1: Given a high-level instruction such as “Adapt for a rural audience,” single-step editors (e.g., Flux Kontext [19] and Qwen-Image-Edit [47]) struggle to jointly adapt visual themes, textual content, and audience-specific context while preserving the original advertisement layout and identity. In contrast, our framework decomposes the task into structured subtasks and orchestrates multiple tools using outc… view at source ↗

**Figure 2.** Figure 2: Checklist-guided planner (Stage 1). Given an input advertisement and a high-level instruction (left, e.g., “Target business travelers with added corporate benefits”), the planner generates a structured, ordered sequence of subtasks (center) that explicitly address checklist items (right). In this example, the generated plan includes adapting the room aesthetics for a professional audience, adding work-rel… view at source ↗

**Figure 3.** Figure 3: Reward-driven orchestration (Stage 2). Given an input image and a high-level instruction (e.g., “Target business travelers with added corporate benefits”), the orchestrator selects appropriate tools and spatial regions to execute a specific subtask (e.g., replacing the red pillows with neutral-toned professional decor pillows). The edited result is then evaluated by a VLM judge, which assesses instructio… view at source ↗

**Figure 4.** Figure 4: User study (randomized A/B testing). Results show a consistent human preference for our approach. User Studies To corroborate the MLLM judge results, we conduct a user study using randomized A/B testing. Participants are shown paired results in random order and asked to select their preferred edit or indicate a tie, while accounting for instruction following, identity preservation, and visual quality. … view at source ↗

**Figure 5.** Figure 5: Qualitative results on diverse long-horizon advertisement editing tasks. These examples show two challenging instructions—adapting for Business travelers, and for American Independence Day. Single-step editors (Flux Kontext [19] and Qwen-Image-Edit [47]) often perform partial stylistic changes or introduce minimal or shallow modifications in text, layout, or branding. In contrast, our method consistently… view at source ↗

**Figure 6.** Figure 6: Qualitative results on diverse long-horizon advertisement editing tasks. These examples show three challenging instructions—adapting for Lunar New Year, western audience, and for fitness-conscious audience. Our method consistently produces edits that are faithful to the instruction and globally coherent, jointly updating visual themes, textual content, layout elements, and brand messaging. Note that we b… view at source ↗

**Figure 7.** Figure 7: Planner–orchestrator subtasks and their outputs. Each row begins with the input advertisement (first column) and a high-level instruction. The subsequent columns show the sequence of subtask plans and edits produced by our system. These results illustrate how our checklist-guided planning decomposes an abstract instruction into concrete atomic edits and how our orchestrator selects tool-regions to execut… view at source ↗

**Figure 8.** Figure 8: Examples from the MagicBrush benchmark. Each example shows the [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Region discovery using the SAM-2 + Qwen3-VL pipeline. SAM-2 first [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: DeepSeek-OCR identifies textual and layout elements in the image by [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗

**Figure 11.** Figure 11: Layer-based region discovery using Qwen-Layered. [PITH_FULL_IMAGE:figures/full_fig_p032_11.png] view at source ↗

**Figure 12.** Figure 12: Instruction-guided region discovery using Qwen-BBox. [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗

read the original abstract

Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly''). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or rule-based multistep baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a planner-orchestrator loop for multi-step image editing trained on VLM judge rewards, but the abstract gives no results or validation to show it works.

read the letter

The core idea is a planner that turns abstract instructions into atomic steps and an orchestrator that chooses tools and regions to carry them out. A vision-language judge scores the final image on instruction match and visual quality, the orchestrator trains to maximize those scores, and good trajectories are fed back to improve the planner. This is meant to replace handcrafted pipelines or simple imitation with something that learns directly from editing outcomes.

Referee Report

2 major / 2 minor

Summary. The paper proposes an experiential framework for open-ended image editing consisting of a planner that decomposes abstract multi-step instructions into atomic steps and an orchestrator that selects tools and regions for execution. A vision-language model judge supplies outcome-based scalar rewards for instruction adherence and visual quality; these rewards train the orchestrator via reinforcement, after which successful trajectories refine the planner. The central claim is that this tight coupling of planning and reward-driven execution produces more coherent and reliable results than single-step models or rule-based multi-step baselines.

Significance. If the empirical claims are substantiated, the work would be significant for agent-based image editing: it moves beyond handcrafted pipelines and imitation learning by grounding both planning and orchestration directly in editing outcomes via a learned reward signal. This could enable more flexible handling of abstract instructions while providing a reproducible training loop that prior methods lack.

major comments (2)

[Experiments] The abstract and method description assert that the approach 'yields more coherent and reliable edits than single-step or rule-based multistep baselines,' yet the manuscript contains no experimental results, quantitative metrics, ablation studies, or implementation details. This absence is load-bearing because the superiority claim cannot be evaluated without evidence that the reward-driven training actually improves coherence.
[Method] Section 3.3 (VLM judge): The central training procedure relies on the vision-language judge supplying accurate, consistent rewards for both instruction adherence and visual quality, but no validation is provided (e.g., human correlation, inter-rater agreement, or error analysis on edge cases). If judge scores systematically misalign with human preference, the learned policy optimizes a noisy objective, directly undermining the coherence advantage.

minor comments (2)

[Method] Clarify the exact interface between planner output and orchestrator input, including the format of atomic decompositions.
[Introduction] Add missing references to recent VLM-as-judge literature and prior agent-based editing systems for proper positioning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We agree that the current manuscript lacks the necessary experimental evidence and judge validation to support the central claims, and we will revise accordingly to include these elements.

read point-by-point responses

Referee: [Experiments] The abstract and method description assert that the approach 'yields more coherent and reliable edits than single-step or rule-based multistep baselines,' yet the manuscript contains no experimental results, quantitative metrics, ablation studies, or implementation details. This absence is load-bearing because the superiority claim cannot be evaluated without evidence that the reward-driven training actually improves coherence.

Authors: We acknowledge that the submitted manuscript does not contain an Experiments section with quantitative results, metrics, ablations, or implementation details. This omission prevents direct evaluation of the superiority claims. In the revised version we will add a full Experiments section reporting comparisons to single-step models and rule-based multi-step baselines on instruction adherence, visual quality, and human preference metrics, together with ablations on the planner-orchestrator interaction and reward-driven training, plus all necessary implementation details. revision: yes
Referee: [Method] Section 3.3 (VLM judge): The central training procedure relies on the vision-language judge supplying accurate, consistent rewards for both instruction adherence and visual quality, but no validation is provided (e.g., human correlation, inter-rater agreement, or error analysis on edge cases). If judge scores systematically misalign with human preference, the learned policy optimizes a noisy objective, directly undermining the coherence advantage.

Authors: We agree that the absence of validation for the VLM judge is a critical gap. In the revised manuscript we will add a dedicated validation subsection that reports correlation between VLM judge scores and human ratings on a held-out set of edited images, inter-rater agreement statistics, and an error analysis covering edge cases such as ambiguous instructions and subtle visual modifications. This will demonstrate the reliability of the reward signal. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the experiential training framework

full rationale

The paper presents a training loop in which an external vision-language judge supplies outcome-based scalar rewards that are used to optimize the orchestrator and to filter trajectories for planner refinement. This structure treats the judge as an independent source of supervision rather than defining any quantity (such as reward or success) in terms of the model's own outputs. No equations or procedures are shown that reduce the claimed performance gain to a fitted parameter or a self-referential definition; the superiority over baselines is asserted as an empirical result of the reward-driven process. No self-citations function as load-bearing uniqueness theorems, and no ansatz or known empirical pattern is smuggled in via citation. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the approach appears to build on existing vision-language models and planning techniques without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5431 in / 1166 out tokens · 59674 ms · 2026-05-15T03:17:29.103673+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

experiential framework ... planner generates structured atomic decompositions and an orchestrator selects tools and regions ... vision-language judge provides outcome-based rewards
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reward-driven policy jointly selects tools and regions based on judged executed edits

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 20 internal anchors

[1]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18392–18402 (2023)

work page 2023
[3]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 22560– 22570 (2023)

work page 2023
[4]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

arXiv preprint arXiv:2309.17102 (2023)

Fu, T.J., Hu, W., Du, X., Wang, W.Y., Yang, Y., Gan, Z.: Guiding instruction- based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102 (2023)

work page arXiv 2023
[6]

Google DeepMind: Gemini 2.0.https://gemini.google.com/(2025), accessed: 2026-03-12

work page 2025
[7]

Google DeepMind: Gemini 3 Pro (2026)

work page 2026
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition

Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition. pp. 14953–14962 (2023)

work page 2023
[10]

Prompt-to-Prompt Image Editing with Cross Attention Control

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hertz, A., Voynov, A., Fruchter, S., Cohen-Or, D.: Style aligned image generation via shared attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4775–4785 (2024)

work page 2024
[12]

Iclr1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

work page 2022
[13]

arXiv preprint arXiv:2406.09403 (2024)

Hu, Y., Shi, W., Fu, X., Roth, D., Ostendorf, M., Zettlemoyer, L., Smith, N.A., Kr- ishna, R.: Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. arXiv preprint arXiv:2406.09403 (2024)

work page arXiv 2024
[14]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., Lin, S.: Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749 (2025) 18 A. S. Rajan et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2026)

Huang, Z., Ji, Y., Rajan, A.S., Cai, Z., Xiao, W., Wang, H., Hu, J., Lee, Y.J.: Vi- sualtoolagent (vista): A reinforcement learning framework for visual tool selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2026)

work page 2026
[16]

Reinforcement Learning via Self-Distillation

H¨ ubotter, J., L¨ ubeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Kleine Buening, T., Guestrin, C., Krause, A.: Reinforce- ment learning via self-distillation. arXiv preprint arXiv:2601.20802 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

In: ICCV (2025)

Ji, L., Qi, C., Chen, Q.: Instruction-based image editing with planning, reasoning, and generation. In: ICCV (2025)

work page 2025
[18]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Ku, M., Jiang, D., Wei, C., Yue, X., Chen, W.: Viescore: Towards explainable metrics for conditional image synthesis evaluation. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 12268–12290 (2024)

work page 2024
[19]

Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., M¨ uller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., Lee, Y.J.: Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22511–22521 (2023)

work page 2023
[21]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

In: CVPR (2024)

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024)

work page 2024
[23]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

work page 2023
[24]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Visual-RFT: Visual Reinforcement Fine-Tuning

Liu, Z., Sun, Z., Zang, Y., Dong, X., Cao, Y., Duan, H., Lin, D., Wang, J.: Visual- rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

arXiv preprint arXiv:2509.23909 (2025)

Luo, X., Wang, J., Wu, C., Xiao, S., Jiang, X., Lian, D., Zhang, J., Liu, D., et al.: Editscore: Unlocking online rl for image editing via high-fidelity reward modeling. arXiv preprint arXiv:2509.23909 (2025)

work page arXiv 2025
[27]

arXiv preprint arXiv:2508.06916 (2025)

Ma, S., Guo, Y., Su, J., Huang, Q., Zhou, Z., Wang, Y.: Talk2image: A multi-agent system for multi-turn image generation and editing. arXiv preprint arXiv:2508.06916 (2025)

work page arXiv 2025
[28]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

arXiv preprint arXiv:2307.02421 (2023)

Mou, C., Wang, X., Song, J., Shan, Y., Zhang, J.: Dragondiffusion: Enabling drag- style manipulation on diffusion models. arXiv preprint arXiv:2307.02421 (2023)

work page arXiv 2023
[30]

arXiv preprint arXiv:2405.08246 (2024)

Nie, W., Liu, S., Mardani, M., Liu, C., Eckart, B., Vahdat, A.: Composi- tional text-to-image generation with dense blob representations. arXiv preprint arXiv:2405.08246 (2024)

work page arXiv 2024
[31]

OpenAI: Learning to reason with llms.https://openai.com/index/ learning-to-reason-with-llms/(2024), accessed: 2025-05-13

work page 2024
[32]

OpenAI: Introducing 4o image generation.https://openai.com/index/ introducing-4o-image-generation/(2025), accessed: 2026-03-12 Learning to Plan and Orchestrate for Open-Ended Image Editing 19

work page 2025
[33]

OpenAI: ChatGPT (2026)

work page 2026
[34]

In: ACM SIGGRAPH 2023 Conference Proceedings

Parmar, G., Kumar Singh, K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to-image translation. In: ACM SIGGRAPH 2023 Conference Proceedings. pp. 1–11 (2023)

work page 2023
[35]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M¨ uller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Doll´ ar, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024),https://arxiv.org/ abs/2408.00714

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

work page 2022
[38]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Sagar, A., Srivastava, R., Venna, V.K., Sarvadevabhatla, R.K., et al.: Madverse: A hierarchical dataset of multi-lingual ads from diverse sources and categories. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 8087–8096 (2024)

work page 2024
[39]

Shenfeld, I., Damani, M., H¨ ubotter, J., Agrawal, P.: Self-distillation enables con- tinual learning (2026),https://arxiv.org/abs/2601.19897

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Shi, Y., Xue, C., Liew, J.H., Pan, J., Yan, H., Zhang, W., Tan, V.Y., Bai, S.: Dragdiffusion: Harnessing diffusion models for interactive point-based image edit- ing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8839–8849 (2024)

work page 2024
[41]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Sur´ ıs, D., Menon, S., Vondrick, C.: Vipergpt: Visual inference via python execu- tion for reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11888–11898 (2023)

work page 2023
[42]

In: NeurIPS (2025),https://arxiv.org/abs/2507.18624

Viswanathan, V., Sun, Y., Ma, S., Kong, X., Cao, M., Neubig, G., Wu, T.: Check- lists are better than reward models for aligning language models. In: NeurIPS (2025),https://arxiv.org/abs/2507.18624

work page arXiv 2025
[43]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

Wang, X., Darrell, T., Rambhatla, S.S., Girdhar, R., Misra, I.: Instancediffusion: Instance-level control for image generation. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 6232–6242 (2024)

work page 2024
[44]

Advances in Neural Information Processing Systems 37, 128374–128395 (2024)

Wang, Z., Li, A., Li, Z., Liu, X.: Genartist: Multimodal llm as an agent for unified image generation and editing. Advances in Neural Information Processing Systems 37, 128374–128395 (2024)

work page 2024
[45]

DeepSeek-OCR: Contexts Optical Compression

Wei, H., Sun, Y., Li, Y.: Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

work page 2022
[47]

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025) 20 A. S. Rajan et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

arXiv preprint arXiv:2509.26346 (2025)

Wu, K., Jiang, S., Ku, M., Nie, P., Liu, M., Chen, W.: Editreward: A human-aligned reward model for instruction-guided image editing. arXiv preprint arXiv:2509.26346 (2025)

work page arXiv 2025
[50]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025)

work page 2025
[51]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Yang, J., Zhang, H., Li, F., Zou, X., Li, C., Gao, J.: Set-of-mark prompting un- leashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

In: ICML (2024)

Yang, L., Yu, Z., Meng, C., Xu, M., Ermon, S., Cui, B.: Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. In: ICML (2024)

work page 2024
[53]

arXiv preprint arXiv:2507.05259 (2025)

Yeh, C.H., Wang, Y., Zhao, N., Zhang, R., Li, Y., Ma, Y., Singh, K.K.: Beyond simple edits: X-planner for complex instruction-based image editing. arXiv preprint arXiv:2507.05259 (2025)

work page arXiv 2025
[54]

Yin, S., Zhang, Z., Tang, Z., Gao, K., Xu, X., Yan, K., Li, J., Chen, Y., Chen, Y., Shum, H.Y., Ni, L.M., Zhou, J., Lin, J., Wu, C.: Qwen-image-layered: Towards in- herent editability via layer decomposition (2025),https://arxiv.org/abs/2512. 15603

work page 2025
[55]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26125–26135 (2025)

work page 2025
[56]

arXiv preprint arXiv:2511.21087 (2025)

Zeng, Z., Hua, H., Luo, J.: Mira: Multimodal iterative reasoning agent for image editing. arXiv preprint arXiv:2511.21087 (2025)

work page arXiv 2025
[57]

Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023)

Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023)

work page 2023
[58]

Advances in Neural Information Pro- cessing Systems36(2024)

Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36(2024)

work page 2024
[59]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, S., Yang, X., Feng, Y., Qin, C., Chen, C.C., Yu, N., Chen, Z., Wang, H., Savarese, S., Ermon, S., et al.: Hive: Harnessing human feedback for instructional visual editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9026–9036 (2024)

work page 2024
[60]

arXiv preprint arXiv:2504.00010 (2025)

Zhang, Y., Li, J., Tai, Y.W.: Layercraft: Enhancing text-to-image generation with cot reasoning and layered object integration. arXiv preprint arXiv:2504.00010 (2025)

work page arXiv 2025
[61]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Zhao, S., Xie, Z., Liu, M., Huang, J., Pang, G., Chen, F., Grover, A.: Self-distilled reasoner: On-policy self-distillation for large language models. arXiv preprint arXiv:2601.18734 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[62]

NeuRIPS (2024)

Zhenyu, W., Aoxue, L., Zhenguo, L., Xihui, L.: Genartist: Multimodal llm as an agent for unified image generation and editing. NeuRIPS (2024)

work page 2024
[63]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023) Learning to Plan and Orchestrate for Open-Ended Image Editing 21 Appendix We provide additional qualitative results, experimental comparisons, and imple- mentation details below...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Replace background with a festive Diwali scene

work page
[65]

Preserve: product pack design

Add fireworks Output: [ "Preserve: product pack design", "Preserve: brand logo", "Preserve: wooden surface", "Replace: background -> Diwali festive scene", "Add: fireworks" ] Learning to Plan and Orchestrate for Open-Ended Image Editing 41 This prompt helps us to generate a dense checklist. Now we use this dense checklist to score the final edit. The syst...

work page