pith. machine review for the scientific record. sign in

arxiv: 2604.25477 · v1 · submitted 2026-04-28 · 💻 cs.CV · cs.AI

Recognition: unknown

DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

Bo Zheng, Cheng Yu, Hanqing Yang, Jun Song, Qiang Zhou, Sashuai Zhou, Tiezheng Ge, Yongchao Du, Zhibin Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords reasoning-driven image editingdecoupled optimizationdual-atomic reinforcement learningplanning modulechecklist rewardsgenerative models
0
0 comments X

The pith

Decoupling the reasoning planner from the image generator and training it with separate cognitive and visual rewards improves complex image editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that reasoning failures in image editing can be addressed by isolating the planning module, called the Thinker, and optimizing it independently while holding the generative Editor fixed. It does this through a dual-atomic reinforcement learning setup that supplies one reward for the quality of the executable plan and another for the quality of the final edited image, both measured by checklists built from a rational reference description of the ideal result. A two-stage data pipeline first creates diverse reasoning-focused examples and then refines them by difficulty to support the training. If the claim holds, targeted updates to planning logic can raise performance on hard edits without retraining the entire generative model.

Core claim

By keeping the generative Editor fixed and training only the Thinker with dual-atomic reinforcement learning, the method decomposes feedback into a cognitive-atomic reward that judges plan quality and a visual-atomic reward that judges image quality, both delivered through verifiable checklists synthesized from the source image, user instruction, and a rational reference description of the desired post-edit scene.

What carries the argument

The dual-atomic reinforcement learning framework that splits rewards into a cognitive-atomic component for assessing the Thinker's executable plan and a visual-atomic component for assessing final image quality, both implemented via checklists.

If this is right

  • Substantial gains appear on reasoning-driven benchmarks including RISE-Bench and KRIS-Bench.
  • Community models reach performance levels competitive with strong proprietary models.
  • The contribution of the planning module can be measured in isolation under a fixed Editor.
  • A two-stage curation pipeline produces a difficulty-aware curriculum that supports effective reinforcement learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decoupled pattern could be tested on video generation or 3D scene editing where planning and rendering are already separate.
  • Only the planner would need retraining when new reasoning patterns emerge, leaving the visual generator untouched.
  • Checklist-based atomic rewards might transfer to other multimodal tasks where direct human feedback is expensive.

Load-bearing premise

The synthesized rational reference descriptions and resulting checklists supply unbiased and reliable signals for both cognitive plan quality and visual image quality.

What would settle it

Performance gains disappear on a new set of editing tasks when the rational reference descriptions are removed from the checklist synthesis process.

Figures

Figures reproduced from arXiv: 2604.25477 by Bo Zheng, Cheng Yu, Hanqing Yang, Jun Song, Qiang Zhou, Sashuai Zhou, Tiezheng Ge, Yongchao Du, Zhibin Wang.

Figure 1
Figure 1. Figure 1: Comparison of Training Paradigms. (a) Prior paradigms jointly or alternately train both Thinker and Editor. In this intertwined optimization, feedback from a single outcome is used to update both modules, making it harder to isolate whether errors arise from planning or execution. Additionally, visual-only rewards may provide limited guidance on the quality of the executable plan. (b) Our DDA-Thinker decou… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Reasoning-Enhanced Data Curation Process. This process constructs two datasets: an initial set for supervised fine-tuning (DSFT) and a refined set for reinforcement learning (DRFT). Stage 1 (Generative Data Curation Pipeline): Guided by a reasoning taxonomy, an LLM serves as the scenario generator to create diverse data triplets. A T2I model then synthesizes the source images, and a VLM-bas… view at source ↗
Figure 3
Figure 3. Figure 3: The DDA-Thinker Training Framework. Our framework trains a Thinker (πθ) over a frozen Editor (E). The core of our method is the dual-atomic reward mechanism, which provides fine-grained feedback on both the executable plan and the final image. The cognitive-atomic reward assesses the plan’s quality (e.g., IP, LS, PE), while the visual-atomic reward assesses the final image’s quality (e.g., IF, AC, HD). The… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Visualization of Our Reward Components. The top row (‘hydraulic press’) shows the visual reward correcting the SFT model’s hallucinated press to improve scene consistency. The bottom row (‘baseball’) highlights the role of the cognitive reward in supporting physically grounded reasoning to render a faithful impact shatter. TABLE IV ABLATION STUDY ON REWARD GRANULARITY. COMPARISON BETWEEN COARSE… view at source ↗
Figure 5
Figure 5. Figure 5: Visual Comparison of Results Without and With DDA-Thinker. (a) Success demonstration of reasoning-driven image editing. (b) Failure diagnosis of system boundaries. Note: <think> blocks are condensed for readability; <answer> blocks are provided in full. V. CONCLUSION We present DDA-Thinker, a Thinker-centric framework for reasoning-driven image editing that decouples planning from generation to explicitly … view at source ↗
read the original abstract

Recent image editing models have achieved strong visual fidelity but often struggle with tasks requiring complex reasoning. To investigate and enhance the reasoning-grounded planning for image editing, we propose DDA-Thinker, a Thinker-centric framework designed for the independent optimization of a planning module (Thinker) over a fixed generative model (Editor). This decoupled Thinker-centric paradigm facilitates a controlled analysis of the planning module and makes its contribution under a fixed Editor easier to assess. To effectively guide this Thinker, we introduce a dual-atomic reinforcement learning framework. This framework decomposes feedback into two distinct atomic rewards implemented through verifiable checklists: a cognitive-atomic reward to directly assess the quality of the Thinker's executable plan, which serves as the actionable outcome of the Thinker's reasoning, and a visual-atomic reward to assess the final image quality. To improve checklist quality, our checklist synthesis is grounded not only in the source image and user instruction but also in a rational reference description of the ideal post-edit scene. To support this training, we further develop a two-stage data curation pipeline that first synthesizes a diverse and reasoning-focused dataset, then applies difficulty-aware refinement to curate an effective training curriculum for reinforcement learning. Extensive experiments on reasoning-driven image editing benchmarks, including RISE-Bench and KRIS-Bench, demonstrate that our approach substantially improves overall performance. Our method enables a community model to achieve results competitive with strong proprietary models, highlighting the practical potential of Thinker-centric optimization under a fixed-editor setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes DDA-Thinker, a Thinker-centric framework that decouples optimization of a reasoning/planning module (Thinker) from a fixed generative Editor using dual-atomic reinforcement learning. Feedback is decomposed into a cognitive-atomic reward (assessing plan quality via verifiable checklists) and a visual-atomic reward (assessing final image quality), with checklists synthesized from LLM-generated rational reference descriptions of ideal post-edit scenes. A two-stage data curation pipeline (diverse synthesis followed by difficulty-aware refinement) supports RL training. Experiments on RISE-Bench and KRIS-Bench are claimed to show substantial overall performance gains, enabling community models to compete with proprietary ones.

Significance. If the central results hold after validation of the reward signals, the work would demonstrate that targeted, decoupled optimization of reasoning components can meaningfully advance complex image editing without retraining the base generative model. This offers a practical route to interpretability and incremental gains in reasoning-driven tasks, with potential to narrow the gap between open-source and closed models.

major comments (3)
  1. [Method (checklist synthesis and dual-atomic rewards)] The central claim of substantial gains and competitiveness with proprietary models rests on the cognitive-atomic and visual-atomic rewards being faithful proxies for reasoning quality and edit success. However, these rewards derive from checklists synthesized from LLM-generated rational references (via two-stage curation and difficulty-aware refinement); no human validation, inter-annotator agreement, or correlation analysis with independent metrics (e.g., human-rated plan executability or edit fidelity) is provided to rule out synthesis biases or reward hacking. This is load-bearing for the decoupled RL contribution.
  2. [Experiments] Experiments section: The abstract and claims assert 'substantially improves overall performance' and 'competitive with strong proprietary models' on RISE-Bench and KRIS-Bench, yet no quantitative metrics, baseline tables, statistical significance tests, or ablation results (e.g., Thinker vs. joint optimization, checklist variants) are referenced in the summary material. Without these, the magnitude, reliability, and attribution of gains to the dual-atomic RL cannot be evaluated.
  3. [Framework overview and Editor-fixed setting] By holding the Editor fixed while optimizing only the Thinker, any mismatch between checklist-rewarded plans and what the Editor can actually render is unaddressed. No analysis of failure cases where high checklist scores yield poor visual outcomes (or vice versa) is given, weakening the claim that the framework isolates and improves reasoning.
minor comments (2)
  1. [Method] Notation for the two atomic rewards and their combination into the RL objective could be clarified with explicit equations or pseudocode to aid reproducibility.
  2. [Discussion or Conclusion] The paper should include a limitations section discussing potential biases in LLM-synthesized references and the generalizability of the approach beyond the chosen benchmarks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below with clarifications from the full paper and commit to revisions that strengthen the presentation and validation of our contributions.

read point-by-point responses
  1. Referee: [Method (checklist synthesis and dual-atomic rewards)] The central claim of substantial gains and competitiveness with proprietary models rests on the cognitive-atomic and visual-atomic rewards being faithful proxies for reasoning quality and edit success. However, these rewards derive from checklists synthesized from LLM-generated rational references (via two-stage curation and difficulty-aware refinement); no human validation, inter-annotator agreement, or correlation analysis with independent metrics (e.g., human-rated plan executability or edit fidelity) is provided to rule out synthesis biases or reward hacking. This is load-bearing for the decoupled RL contribution.

    Authors: We agree that validating the faithfulness of the synthesized checklists and dual-atomic rewards is critical to support the core claims. The manuscript (Sections 3.1–3.2) grounds checklist synthesis in LLM-generated rational reference descriptions of ideal post-edit scenes, combined with the two-stage curation pipeline for diversity and difficulty-aware refinement. However, we acknowledge the absence of human validation, inter-annotator agreement metrics, or explicit correlation analysis with human-rated plan executability and edit fidelity. In the revised manuscript, we will add a dedicated human evaluation study (including agreement scores and Pearson/Spearman correlations with independent human judgments) as a new subsection or appendix to directly address potential synthesis biases and reward hacking concerns. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract and claims assert 'substantially improves overall performance' and 'competitive with strong proprietary models' on RISE-Bench and KRIS-Bench, yet no quantitative metrics, baseline tables, statistical significance tests, or ablation results (e.g., Thinker vs. joint optimization, checklist variants) are referenced in the summary material. Without these, the magnitude, reliability, and attribution of gains to the dual-atomic RL cannot be evaluated.

    Authors: The full manuscript contains a detailed Experiments section (Section 4) reporting quantitative metrics on RISE-Bench and KRIS-Bench, baseline comparisons, ablation studies (including Thinker-only vs. joint optimization and checklist variants), and performance breakdowns that support the claims of substantial gains and competitiveness with proprietary models. We will revise the abstract and introduction to explicitly cross-reference these tables, figures, and statistical details so that the magnitude, reliability, and attribution of improvements to the dual-atomic RL framework are immediately clear without requiring readers to locate them in the body. revision: partial

  3. Referee: [Framework overview and Editor-fixed setting] By holding the Editor fixed while optimizing only the Thinker, any mismatch between checklist-rewarded plans and what the Editor can actually render is unaddressed. No analysis of failure cases where high checklist scores yield poor visual outcomes (or vice versa) is given, weakening the claim that the framework isolates and improves reasoning.

    Authors: The visual-atomic reward explicitly scores the final image quality after the fixed Editor renders the Thinker’s plan, so any mismatch between plan quality and renderability is penalized during RL optimization. This design choice directly ties the decoupled training to observable visual success. We acknowledge that the current version lacks a dedicated failure-case analysis. In the revision, we will add an analysis subsection (with qualitative examples) examining cases of high cognitive-atomic scores paired with low visual-atomic outcomes (and the reverse) to better illustrate how the framework isolates reasoning improvements while accounting for Editor limitations. revision: partial

Circularity Check

0 steps flagged

No circularity: external benchmarks and independent synthesis pipeline keep claims self-contained.

full rationale

The paper's core contribution is a decoupled RL training procedure for a Thinker module, with rewards obtained from checklists synthesized via a two-stage curation process grounded in source images, instructions, and LLM-generated rational references. Performance is measured on separate external benchmarks (RISE-Bench, KRIS-Bench) rather than on the training checklists themselves. No equations, fitted parameters, or self-citations are presented that would make the reported gains equivalent to the training inputs by construction. The synthesis step affects only the reward signal used during optimization; it does not redefine or tautologically guarantee the benchmark outcomes. This satisfies the default expectation of a non-circular empirical method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

Abstract introduces framework components without mathematical definitions or external grounding; no free parameters, axioms, or independently evidenced entities are specified.

invented entities (3)
  • Thinker module no independent evidence
    purpose: Independent reasoning and planning module for image edits
    Core new component whose optimization is the paper's focus
  • cognitive-atomic reward no independent evidence
    purpose: Direct assessment of executable plan quality via checklists
    One of the two atomic feedback signals
  • visual-atomic reward no independent evidence
    purpose: Assessment of final image quality via checklists
    Second atomic feedback signal

pith-pipeline@v0.9.0 · 5598 in / 1321 out tokens · 54933 ms · 2026-05-07T16:39:44.022189+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 40 canonical work pages · 8 internal anchors

  1. [1]

    Qwen-image technical report,

    C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S. ming Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu, “Qwen-image technica...

  2. [2]

    Flux.1: Text-to-image synthesis via flow matching,

    B. F. Labs, “Flux.1: Text-to-image synthesis via flow matching,”Technical Announcement, 2024

  3. [3]

    Longcat-image technical report

    M. L. Team, H. Ma, H. Tan, J. Huang, J. Wu, J.-Y . He, L. Gao, S. Xiao, X. Wei, X. Maet al., “Longcat-image technical report,”arXiv preprint arXiv:2512.07584, 2025

  4. [4]

    Diffusion Models Beat GANs on Image Synthesis

    P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”arXiv preprint arXiv:2105.05233, 2021

  5. [5]

    Diffusion model alignment using direct preference optimization,

    B. Wallace, A. Gokul, and N. Naik, “Diffusion model alignment using direct preference optimization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  6. [6]

    Diffusion model-based image editing: A survey,

    Y . Huang, J. Huang, Y . Liu, M. Yan, J. Lv, J. Liu, W. Xiong, H. Zhang, L. Cao, and S. Chen, “Diffusion model-based image editing: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 6, pp. 4409–4437,

  7. [7]

    Available: https://doi.org/10.1109/TPAMI.2025.3541625

    [Online]. Available: https://doi.org/10.1109/TPAMI.2025.3541625

  8. [8]

    Freeedit: Mask-free reference-based image editing with multi-modal instruction,

    R. He, K. Ma, L. Huang, S. Huang, J. Gao, X. Wei, J. Dai, J. Han, and S. Liu, “Freeedit: Mask-free reference-based image editing with multi-modal instruction,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, no. 3, pp. 3319–3334, 2026. [Online]. Available: https://doi.org/10.1109/TPAMI.2025.3636582

  9. [9]

    Instructpix2pix: Learning to follow image editing instructions,

    T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 392– 18 402

  10. [10]

    Opencir: Conditional image repainting with open condition mixture,

    S. Weng, X. Gong, H. Zheng, X. Wang, S. Li, and B. Shi, “Opencir: Conditional image repainting with open condition mixture,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 11, pp. 10 406–10 419, 2025. [Online]. Available: https://doi.org/10.1109/TPAMI.2025.3597936

  11. [11]

    Step1X-Edit: A Practical Framework for General Image Editing

    S. Liu, Y . Han, P. Xing, F. Yin, R. Wang, W. Cheng, J. Liao, Y . Wang, H. Fu, C. Han, G. Li, Y . Peng, Q. Sun, J. Wu, Y . Cai, Z. Ge, R. Ming, L. Xia, X. Zeng, Y . Zhu, B. Jiao, X. Zhang, G. Yu, and D. Jiang, “Step1x- edit: A practical framework for general image editing,”arXiv preprint arXiv:2504.17761, 2025

  12. [12]

    Imagegen- cot: Enhancing text-to-image in-context learning with chain-of-thought reasoning,

    J. Liao, Z. Yang, L. Li, D. Li, K. Lin, Y . Cheng, and L. Wang, “Imagegen- cot: Enhancing text-to-image in-context learning with chain-of-thought reasoning,”arXiv preprint arXiv:2503.19312, 2025

  13. [13]

    T2i- compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation,

    K. Huang, C. Duan, K. Sun, E. Xie, Z. Li, and X. Liu, “T2i- compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 5, pp. 3563–3579, 2025. [Online]. Available: https://doi.org/10.1109/TPAMI.2025.3531907

  14. [14]

    Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models.arXiv preprint arXiv:2305.13655,

    L. Lian, B. Li, A. Yala, and T. Darrell, “Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models,” 2024. [Online]. Available: https: //arxiv.org/abs/2305.13655

  15. [15]

    Editthinker: Unlocking iterative reasoning for any image editor,

    H. Li, M. Zhang, D. Zheng, Z. Guo, Y . Jia, K. Feng, H. Yu, Y . Liu, Y . Feng, P. Pei, X. Cai, L. Huang, H. Li, and S. Liu, “Editthinker: Unlocking iterative reasoning for any image editor,” 2025

  16. [16]

    Spatialreward: Verifiable spatial re- ward modeling for fine-grained spatial consistency in text-to- image generation.arXiv preprint arXiv:2603.22228, 2026

    S. Zhou, Q. Zhou, J. Ma, Y . Cao, R. Hu, Z. Zhang, X. Yang, Z. Wang, J. Song, C. Yu, B. Zheng, and Z. Zhao, “Spatialreward: Verifiable spatial reward modeling for fine-grained spatial consistency in text-to-image generation,” 2026. [Online]. Available: https://arxiv.org/abs/2603.22228

  17. [17]

    Evaluating compositional text-to-image generation: A benchmark and analysis,

    Z. Wanget al., “Evaluating compositional text-to-image generation: A benchmark and analysis,” inarXiv preprint arXiv:2401.0xxx, 2024

  18. [18]

    Envisioning beyond the pixels: Bench- marking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826, 2025

    X. Zhao, P. Zhang, K. Tang, H. Li, Z. Zhang, G. Zhai, J. Yan, H. Yang, X. Yang, and H. Duan, “Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing,”arXiv preprint arXiv:2504.02826, 2025

  19. [19]

    Unified Thinker: A General Reasoning Modular Core for Image Generation

    S. Zhou, Q. Zhou, J. Hu, H. Yang, Y . Cao, J. Ma, Y . Ma, J. Song, T. Ge, C. Yuet al., “Unified thinker: A general reasoning modular core for image generation,”arXiv preprint arXiv:2601.03127, 2026

  20. [20]

    ThinkRL-Edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026

    H. Li, L. Jiang, Q. Yan, Y . Song, H. Kang, Z. Liu, X. Lu, B. Wu, and D. Cai, “Thinkrl-edit: Thinking in reinforcement learning for reasoning- centric image editing,”arXiv preprint arXiv:2601.03467, 2026

  21. [21]

    Thinkgen: Generalized thinking for visual generation,

    S. Jiao, Y . Lin, Y . Zhong, Q. She, W. Zhou, X. Lan, Z. Huang, F. Yu, Y . Yu, Y . Zhaoet al., “Thinkgen: Generalized thinking for visual generation,” arXiv preprint arXiv:2512.23568, 2025

  22. [22]

    Reasonedit: Towards reasoning-enhanced image editing models,

    F. Yin, S. Liu, Y . Han, Z. Wang, P. Xing, R. Wang, W. Cheng, Y . Wang, A. Li, Z. Yin, P. Chen, X. Zhang, D. Jiang, X. Zeng, and G. Yu, “Reasonedit: Towards reasoning-enhanced image editing models,” 2025

  23. [23]

    Flow-GRPO: Training Flow Matching Models via Online RL

    J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang, “Flow-grpo: Training flow matching models via online rl,” arXiv preprint arXiv:2505.05470, 2025

  24. [24]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Z. Xue, J. Wu, Y . Gao, F. Kong, L. Zhu, M. Chen, Z. Liu, W. Liu, Q. Guo, W. Huanget al., “Dancegrpo: Unleashing grpo on visual generation,” arXiv preprint arXiv:2505.07818, 2025

  25. [25]

    Text-guided human image manipulation via image-text shared space,

    X. Xu, Y . Chen, X. Tao, and J. Jia, “Text-guided human image manipulation via image-text shared space,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 6486–6500, 2022. [Online]. Available: https://doi.org/10.1109/TPAMI.2021.3085339

  26. [26]

    Consistent image layout editing with diffusion models,

    T. Xia, Y . Zhang, T. Liu, and L. Zhang, “Consistent image layout editing with diffusion models,”IEEE Transactions on Image Processing, vol. 34, pp. 6978–6992, 2025

  27. [27]

    Instruction-driven multi-weather image translation based on a large-scale image editing model,

    Y . Feng, J. Li, and M. Zhou, “Instruction-driven multi-weather image translation based on a large-scale image editing model,”IEEE Transac- tions on Image Processing, vol. 34, pp. 7462–7472, 2025

  28. [28]

    Spherical patch generative adversarial net for unconditional panoramic image generation,

    M. Xu, X. Sun, S. Li, L. Jiang, J. Xia, and X. Deng, “Spherical patch generative adversarial net for unconditional panoramic image generation,” IEEE Transactions on Image Processing, vol. 34, pp. 3833–3848, 2025

  29. [29]

    Talk-to-edit: Fine-grained 2d and 3d facial editing via dialog,

    Y . Jiang, Z. Huang, T. Wu, X. Pan, C. C. Loy, and Z. Liu, “Talk-to-edit: Fine-grained 2d and 3d facial editing via dialog,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 5, pp. 3692–3706, 2024. [Online]. Available: https://doi.org/10.1109/TPAMI.2023.3347299

  30. [30]

    Pixel-inconsistency modeling for image manipulation localization,

    C. Kong, A. Luo, S. Wang, H. Li, A. Rocha, and A. C. Kot, “Pixel-inconsistency modeling for image manipulation localization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 6, pp. 4455–4472,

  31. [31]

    Available: https://doi.org/10.1109/TPAMI.2025.3541028

    [Online]. Available: https://doi.org/10.1109/TPAMI.2025.3541028

  32. [32]

    Deepgen 1.0: A lightweight unified multimodal model for advancing image generation and editing.arXiv preprint arXiv:2602.12205, 2026

    D. Wang, R. Li, F. Han, C. Ma, W. Song, S. Wang, Y . Wang, Y . Xin, H. Liu, Z. Zhanget al., “Deepgen 1.0: A lightweight unified multimodal model for advancing image generation and editing,”arXiv preprint arXiv:2602.12205, 2026. 13

  33. [33]

    Unit: Unified multimodal chain-of-thought test-time scaling,

    L. L. Chen, H. Ma, Z. Fan, Z. Huang, A. Sinha, X. Dai, J. Wang, Z. He, J. Yang, C. Liet al., “Unit: Unified multimodal chain-of-thought test-time scaling,”arXiv preprint arXiv:2602.12279, 2026

  34. [34]

    Endocot: Scaling endogenous chain-of-thought reasoning in diffusion models,

    X. Dai, Y . Zhou, L. Xing, J. Bu, X. Wei, Y . Liu, B. Zhang, K. Chen, and Y . Zang, “Endocot: Scaling endogenous chain-of-thought reasoning in diffusion models,”arXiv preprint arXiv:2603.12252, 2026

  35. [35]

    Promptrl: Prompt matters in rl for flow-based image generation,

    F.-Y . Wang, H. Zhang, M. Gharbi, H. Li, and T. Park, “Promptrl: Prompt matters in rl for flow-based image generation,”arXiv preprint arXiv:2602.01382, 2026

  36. [36]

    Consensus- agent deep reinforcement learning for face aging,

    L. Lin, H. Liu, J. Liang, Z. Li, J. Feng, and H. Han, “Consensus- agent deep reinforcement learning for face aging,”IEEE Trans. Image Process., vol. 33, pp. 1795–1809, 2024. [Online]. Available: https://doi.org/10.1109/TIP.2024.3364074

  37. [37]

    Replan: Reasoning-guided region planning for complex instruction-based image editing,

    T. Qu, L. Ke, X. Zhan, L. Tang, Y . Liu, B. Peng, B. Yu, D. Yu, and J. Jia, “Replan: Reasoning-guided region planning for complex instruction-based image editing,”arXiv preprint arXiv:2512.16864, 2025

  38. [38]

    Unireason 1.0: A unified reasoning framework for world knowledge aligned image generation and editing.arXiv preprint arXiv:2602.02437, 2026

    D. Wang, C. Ma, F. Han, S. Wu, W. Song, Y . Wang, Z. Zhang, T. Wang, S. Wang, Z. Weiet al., “Unireason 1.0: A unified reasoning framework for world knowledge aligned image generation and editing,”arXiv preprint arXiv:2602.02437, 2026

  39. [39]

    Think-then-generate: Reasoning-aware text-to-image diffusion with llm encoders,

    S. Kou, J. Jin, Z. Zhou, Y . Ma, Y . Wang, Q. Chen, P. Jiang, X. Yang, J. Zhu, K. Yuet al., “Think-then-generate: Reasoning-aware text-to-image diffusion with llm encoders,”arXiv preprint arXiv:2601.10332, 2026

  40. [40]

    Edit-r1: Unleashing reasoning-based reinforcement learning for image editing

    H. Guo, J. Wu, J. Liu, Y . Gao, Z. Ye, L. Yuan, X. Wang, Y . Yu, and W. Huang, “Edit-r1: Unleashing reasoning-based reinforcement learning for image editing.”

  41. [41]

    Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

    Y . Niu, M. Ning, M. Zheng, W. Jin, B. Lin, P. Jin, J. Liao, K. Ning, C. Feng, B. Zhu, and L. Yuan, “Wise: A world knowledge-informed semantic evaluation for text-to-image generation,”arXiv preprint arXiv:2503.07265, 2025

  42. [42]

    Spatialclip: Learning 3d-aware image representations from spatially discriminative language,

    Z. Wang, S. Zhou, S. He, H. Huang, L. Yang, Z. Zhang, X. Cheng, S. Ji, T. Jin, H. Zhao, and Z. Zhao, “Spatialclip: Learning 3d-aware image representations from spatially discriminative language,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 29 656–29 666

  43. [43]

    Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv preprint arXiv:2508.21066, 2025

    Y . Gong, X. Wang, J. Wu, S. Wang, Y . Wang, and X. Wu, “Onereward: Unified mask-guided image generation via multi-task human preference learning,”arXiv preprint arXiv:2508.21066, 2025

  44. [44]

    Describe, spot and explain: Interpretable representation learning for discriminative visual reasoning,

    C. Lin and Y . F. Wang, “Describe, spot and explain: Interpretable representation learning for discriminative visual reasoning,”IEEE Trans. Image Process., vol. 32, pp. 2481–2492, 2023. [Online]. Available: https://doi.org/10.1109/TIP.2023.3268001

  45. [45]

    Editscore: Unlocking online rl for image editing via high-fidelity reward modeling.arXiv preprint arXiv:2509.23909, 2025

    X. Luo, J. Wang, C. Wu, S. Xiao, X. Jiang, D. Lian, J. Zhang, D. Liu et al., “Editscore: Unlocking online rl for image editing via high-fidelity reward modeling,”arXiv preprint arXiv:2509.23909, 2025

  46. [46]

    Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

    K. Wu, S. Jiang, M. Ku, P. Nie, M. Liu, and W. Chen, “Editreward: A human-aligned reward model for instruction-guided image editing,” arXiv preprint arXiv:2509.26346, 2025

  47. [47]

    Decompose and compare consistency: Measuring vlms’ answer reliability via task-decomposition consistency comparison,

    Q. Yang, W. Yan, and A. Agrawal, “Decompose and compare consistency: Measuring vlms’ answer reliability via task-decomposition consistency comparison,” 2024

  48. [48]

    A minimalist approach to llm reasoning: from rejection sampling to reinforce,

    W. Xiong, J. Yao, Y . Xu, B. Pang, L. Wang, D. Sahoo, J. Li, N. Jiang, T. Zhang, C. Xionget al., “A minimalist approach to llm reasoning: from rejection sampling to reinforce,”arXiv, 2025

  49. [49]

    Star-ds: Step-level uncertainty-aware reasoning data selection in reinforcement learning for llm multi-step reasoning

    S. Wu, D. Li, W. Feng, H. Ye, J. Lou, and S.-K. Ng, “Star-ds: Step-level uncertainty-aware reasoning data selection in reinforcement learning for llm multi-step reasoning.”

  50. [50]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models,

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024, the origin of GRPO algorithm

  51. [51]

    Generative multimodal models are in-context learners,

    Q. Sun, Y . Cui, X. Zhang, F. Zhang, Q. Yu, Z. Luo, Y . Wang, Y . Rao, J. Liu, T. Huang, and X. Wang, “Generative multimodal models are in-context learners,” 2024

  52. [52]

    Omnigen: Unified image generation,

    S. Xiao, Y . Wang, J. Zhou, H. Yuan, X. Xing, R. Yan, C. Li, S. Wang, T. Huang, and Z. Liu, “Omnigen: Unified image generation,” pp. 13 294– 13 304, 2025

  53. [53]

    Ovis: Structural embedding alignment for multimodal large language model, 2024.arXiv preprint arXiv:2405.20797, 2024

    S. Lu, Y . Li, Q.-G. Chen, Z. Xu, W. Luo, K. Zhang, and H.-J. Ye, “Ovis: Structural embedding alignment for multimodal large language model,” arXiv:2405.20797, 2024

  54. [54]

    Emerging Properties in Unified Multimodal Pretraining

    C. Deng, D. Zhu, K. Li, C. Gou, F. Li, Z. Wang, S. Zhong, W. Yu, X. Nie, Z. Song, G. Shi, and H. Fan, “Emerging properties in unified multimodal pretraining,”arXiv preprint arXiv:2505.14683, 2025

  55. [55]

    Uni-cot: Towards unified chain-of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606,

    L. Qin, J. Gong, Y . Sun, T. Li, M. Yang, X. Yang, C. Qu, Z. Tan, and H. Li, “Uni-cot: Towards unified chain-of-thought reasoning across text and vision,”arXiv preprint arXiv:2508.05606, 2025

  56. [56]

    Magicbrush: A manually annotated dataset for instruction-guided image editing,

    K. Zhang, L. Mo, W. Yu, Y . Deng, Y . Xie, and D. Liu, “Magicbrush: A manually annotated dataset for instruction-guided image editing,” in Advances in Neural Information Processing Systems (NeurIPS), 2024

  57. [57]

    Anyedit: Mastering unified high-quality image editing for any idea,

    Q. Yu, W. Chow, Z. Yue, K. Pan, Y . Wu, X. Wan, J. Li, S. Tang, H. Zhang, and Y . Zhuang, “Anyedit: Mastering unified high-quality image editing for any idea,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 26 125–26 135

  58. [58]

    Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

    Q. Cai, J. Chen, Y . Chen, Y . Li, F. Long, Y . Pan, Z. Qiu, Y . Zhang, F. Gao, P. Xuet al., “Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer,”arXiv preprint arXiv:2505.22705, 2025

  59. [59]

    Bytemorph: Benchmarking instruction- guided image editing with non-rigid motions.ArXiv, abs/2506.03107, 2025

    D. Chang, M. Cao, Y . Shi, B. Liu, S. Cai, S. Zhou, W. Huang, G. Wetzstein, M. Soleymani, and P. Wang, “Bytemorph: Benchmarking instruction-guided image editing with non-rigid motions,”arXiv preprint arXiv:2506.03107, 2025

  60. [60]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    C. Wu, P. Zheng, R. Yan, S. Xiao, X. Luo, Y . Wang, W. Li, X. Jiang, Y . Liu, J. Zhou, Z. Liu, Z. Xia, C. Li, H. Deng, J. Wang, K. Luo, B. Zhang, D. Lian, X. Wang, Z. Wang, T. Huang, and Z. Liu, “Omnigen2: Exploration to advanced multimodal generation,”CoRR, vol. abs/2506.18871, 2025

  61. [61]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    B. Lin, Z. Li, X. Cheng, Y . Niu, Y . Ye, X. He, S. Yuan, W. Yu, S. Wang, Y . Ge, Y . Pang, and L. Yuan, “Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation,”CoRR, vol. abs/2506.03147, 2025

  62. [62]

    Qwen3-vl technical report,

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

  63. [63]

    arXiv preprint arXiv:2505.16707 (2025)

    Y . Wu, Z. Li, X. Hu, X. Ye, X. Zeng, G. Yu, W. Zhu, B. Schiele, M.-H. Yang, and X. Yang, “Kris-bench: Benchmarking next-level intelligent image editing models,”arXiv preprint arXiv:2505.16707, 2025