Recognition: unknown
DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing
Pith reviewed 2026-05-07 16:39 UTC · model grok-4.3
The pith
Decoupling the reasoning planner from the image generator and training it with separate cognitive and visual rewards improves complex image editing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By keeping the generative Editor fixed and training only the Thinker with dual-atomic reinforcement learning, the method decomposes feedback into a cognitive-atomic reward that judges plan quality and a visual-atomic reward that judges image quality, both delivered through verifiable checklists synthesized from the source image, user instruction, and a rational reference description of the desired post-edit scene.
What carries the argument
The dual-atomic reinforcement learning framework that splits rewards into a cognitive-atomic component for assessing the Thinker's executable plan and a visual-atomic component for assessing final image quality, both implemented via checklists.
If this is right
- Substantial gains appear on reasoning-driven benchmarks including RISE-Bench and KRIS-Bench.
- Community models reach performance levels competitive with strong proprietary models.
- The contribution of the planning module can be measured in isolation under a fixed Editor.
- A two-stage curation pipeline produces a difficulty-aware curriculum that supports effective reinforcement learning.
Where Pith is reading between the lines
- The same decoupled pattern could be tested on video generation or 3D scene editing where planning and rendering are already separate.
- Only the planner would need retraining when new reasoning patterns emerge, leaving the visual generator untouched.
- Checklist-based atomic rewards might transfer to other multimodal tasks where direct human feedback is expensive.
Load-bearing premise
The synthesized rational reference descriptions and resulting checklists supply unbiased and reliable signals for both cognitive plan quality and visual image quality.
What would settle it
Performance gains disappear on a new set of editing tasks when the rational reference descriptions are removed from the checklist synthesis process.
Figures
read the original abstract
Recent image editing models have achieved strong visual fidelity but often struggle with tasks requiring complex reasoning. To investigate and enhance the reasoning-grounded planning for image editing, we propose DDA-Thinker, a Thinker-centric framework designed for the independent optimization of a planning module (Thinker) over a fixed generative model (Editor). This decoupled Thinker-centric paradigm facilitates a controlled analysis of the planning module and makes its contribution under a fixed Editor easier to assess. To effectively guide this Thinker, we introduce a dual-atomic reinforcement learning framework. This framework decomposes feedback into two distinct atomic rewards implemented through verifiable checklists: a cognitive-atomic reward to directly assess the quality of the Thinker's executable plan, which serves as the actionable outcome of the Thinker's reasoning, and a visual-atomic reward to assess the final image quality. To improve checklist quality, our checklist synthesis is grounded not only in the source image and user instruction but also in a rational reference description of the ideal post-edit scene. To support this training, we further develop a two-stage data curation pipeline that first synthesizes a diverse and reasoning-focused dataset, then applies difficulty-aware refinement to curate an effective training curriculum for reinforcement learning. Extensive experiments on reasoning-driven image editing benchmarks, including RISE-Bench and KRIS-Bench, demonstrate that our approach substantially improves overall performance. Our method enables a community model to achieve results competitive with strong proprietary models, highlighting the practical potential of Thinker-centric optimization under a fixed-editor setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DDA-Thinker, a Thinker-centric framework that decouples optimization of a reasoning/planning module (Thinker) from a fixed generative Editor using dual-atomic reinforcement learning. Feedback is decomposed into a cognitive-atomic reward (assessing plan quality via verifiable checklists) and a visual-atomic reward (assessing final image quality), with checklists synthesized from LLM-generated rational reference descriptions of ideal post-edit scenes. A two-stage data curation pipeline (diverse synthesis followed by difficulty-aware refinement) supports RL training. Experiments on RISE-Bench and KRIS-Bench are claimed to show substantial overall performance gains, enabling community models to compete with proprietary ones.
Significance. If the central results hold after validation of the reward signals, the work would demonstrate that targeted, decoupled optimization of reasoning components can meaningfully advance complex image editing without retraining the base generative model. This offers a practical route to interpretability and incremental gains in reasoning-driven tasks, with potential to narrow the gap between open-source and closed models.
major comments (3)
- [Method (checklist synthesis and dual-atomic rewards)] The central claim of substantial gains and competitiveness with proprietary models rests on the cognitive-atomic and visual-atomic rewards being faithful proxies for reasoning quality and edit success. However, these rewards derive from checklists synthesized from LLM-generated rational references (via two-stage curation and difficulty-aware refinement); no human validation, inter-annotator agreement, or correlation analysis with independent metrics (e.g., human-rated plan executability or edit fidelity) is provided to rule out synthesis biases or reward hacking. This is load-bearing for the decoupled RL contribution.
- [Experiments] Experiments section: The abstract and claims assert 'substantially improves overall performance' and 'competitive with strong proprietary models' on RISE-Bench and KRIS-Bench, yet no quantitative metrics, baseline tables, statistical significance tests, or ablation results (e.g., Thinker vs. joint optimization, checklist variants) are referenced in the summary material. Without these, the magnitude, reliability, and attribution of gains to the dual-atomic RL cannot be evaluated.
- [Framework overview and Editor-fixed setting] By holding the Editor fixed while optimizing only the Thinker, any mismatch between checklist-rewarded plans and what the Editor can actually render is unaddressed. No analysis of failure cases where high checklist scores yield poor visual outcomes (or vice versa) is given, weakening the claim that the framework isolates and improves reasoning.
minor comments (2)
- [Method] Notation for the two atomic rewards and their combination into the RL objective could be clarified with explicit equations or pseudocode to aid reproducibility.
- [Discussion or Conclusion] The paper should include a limitations section discussing potential biases in LLM-synthesized references and the generalizability of the approach beyond the chosen benchmarks.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below with clarifications from the full paper and commit to revisions that strengthen the presentation and validation of our contributions.
read point-by-point responses
-
Referee: [Method (checklist synthesis and dual-atomic rewards)] The central claim of substantial gains and competitiveness with proprietary models rests on the cognitive-atomic and visual-atomic rewards being faithful proxies for reasoning quality and edit success. However, these rewards derive from checklists synthesized from LLM-generated rational references (via two-stage curation and difficulty-aware refinement); no human validation, inter-annotator agreement, or correlation analysis with independent metrics (e.g., human-rated plan executability or edit fidelity) is provided to rule out synthesis biases or reward hacking. This is load-bearing for the decoupled RL contribution.
Authors: We agree that validating the faithfulness of the synthesized checklists and dual-atomic rewards is critical to support the core claims. The manuscript (Sections 3.1–3.2) grounds checklist synthesis in LLM-generated rational reference descriptions of ideal post-edit scenes, combined with the two-stage curation pipeline for diversity and difficulty-aware refinement. However, we acknowledge the absence of human validation, inter-annotator agreement metrics, or explicit correlation analysis with human-rated plan executability and edit fidelity. In the revised manuscript, we will add a dedicated human evaluation study (including agreement scores and Pearson/Spearman correlations with independent human judgments) as a new subsection or appendix to directly address potential synthesis biases and reward hacking concerns. revision: yes
-
Referee: [Experiments] Experiments section: The abstract and claims assert 'substantially improves overall performance' and 'competitive with strong proprietary models' on RISE-Bench and KRIS-Bench, yet no quantitative metrics, baseline tables, statistical significance tests, or ablation results (e.g., Thinker vs. joint optimization, checklist variants) are referenced in the summary material. Without these, the magnitude, reliability, and attribution of gains to the dual-atomic RL cannot be evaluated.
Authors: The full manuscript contains a detailed Experiments section (Section 4) reporting quantitative metrics on RISE-Bench and KRIS-Bench, baseline comparisons, ablation studies (including Thinker-only vs. joint optimization and checklist variants), and performance breakdowns that support the claims of substantial gains and competitiveness with proprietary models. We will revise the abstract and introduction to explicitly cross-reference these tables, figures, and statistical details so that the magnitude, reliability, and attribution of improvements to the dual-atomic RL framework are immediately clear without requiring readers to locate them in the body. revision: partial
-
Referee: [Framework overview and Editor-fixed setting] By holding the Editor fixed while optimizing only the Thinker, any mismatch between checklist-rewarded plans and what the Editor can actually render is unaddressed. No analysis of failure cases where high checklist scores yield poor visual outcomes (or vice versa) is given, weakening the claim that the framework isolates and improves reasoning.
Authors: The visual-atomic reward explicitly scores the final image quality after the fixed Editor renders the Thinker’s plan, so any mismatch between plan quality and renderability is penalized during RL optimization. This design choice directly ties the decoupled training to observable visual success. We acknowledge that the current version lacks a dedicated failure-case analysis. In the revision, we will add an analysis subsection (with qualitative examples) examining cases of high cognitive-atomic scores paired with low visual-atomic outcomes (and the reverse) to better illustrate how the framework isolates reasoning improvements while accounting for Editor limitations. revision: partial
Circularity Check
No circularity: external benchmarks and independent synthesis pipeline keep claims self-contained.
full rationale
The paper's core contribution is a decoupled RL training procedure for a Thinker module, with rewards obtained from checklists synthesized via a two-stage curation process grounded in source images, instructions, and LLM-generated rational references. Performance is measured on separate external benchmarks (RISE-Bench, KRIS-Bench) rather than on the training checklists themselves. No equations, fitted parameters, or self-citations are presented that would make the reported gains equivalent to the training inputs by construction. The synthesis step affects only the reward signal used during optimization; it does not redefine or tautologically guarantee the benchmark outcomes. This satisfies the default expectation of a non-circular empirical method paper.
Axiom & Free-Parameter Ledger
invented entities (3)
-
Thinker module
no independent evidence
-
cognitive-atomic reward
no independent evidence
-
visual-atomic reward
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Qwen-image technical report,
C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S. ming Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu, “Qwen-image technica...
2025
-
[2]
Flux.1: Text-to-image synthesis via flow matching,
B. F. Labs, “Flux.1: Text-to-image synthesis via flow matching,”Technical Announcement, 2024
2024
-
[3]
Longcat-image technical report
M. L. Team, H. Ma, H. Tan, J. Huang, J. Wu, J.-Y . He, L. Gao, S. Xiao, X. Wei, X. Maet al., “Longcat-image technical report,”arXiv preprint arXiv:2512.07584, 2025
-
[4]
Diffusion Models Beat GANs on Image Synthesis
P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”arXiv preprint arXiv:2105.05233, 2021
work page internal anchor Pith review arXiv 2021
-
[5]
Diffusion model alignment using direct preference optimization,
B. Wallace, A. Gokul, and N. Naik, “Diffusion model alignment using direct preference optimization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
2024
-
[6]
Diffusion model-based image editing: A survey,
Y . Huang, J. Huang, Y . Liu, M. Yan, J. Lv, J. Liu, W. Xiong, H. Zhang, L. Cao, and S. Chen, “Diffusion model-based image editing: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 6, pp. 4409–4437,
-
[7]
Available: https://doi.org/10.1109/TPAMI.2025.3541625
[Online]. Available: https://doi.org/10.1109/TPAMI.2025.3541625
-
[8]
Freeedit: Mask-free reference-based image editing with multi-modal instruction,
R. He, K. Ma, L. Huang, S. Huang, J. Gao, X. Wei, J. Dai, J. Han, and S. Liu, “Freeedit: Mask-free reference-based image editing with multi-modal instruction,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, no. 3, pp. 3319–3334, 2026. [Online]. Available: https://doi.org/10.1109/TPAMI.2025.3636582
-
[9]
Instructpix2pix: Learning to follow image editing instructions,
T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 392– 18 402
2023
-
[10]
Opencir: Conditional image repainting with open condition mixture,
S. Weng, X. Gong, H. Zheng, X. Wang, S. Li, and B. Shi, “Opencir: Conditional image repainting with open condition mixture,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 11, pp. 10 406–10 419, 2025. [Online]. Available: https://doi.org/10.1109/TPAMI.2025.3597936
-
[11]
Step1X-Edit: A Practical Framework for General Image Editing
S. Liu, Y . Han, P. Xing, F. Yin, R. Wang, W. Cheng, J. Liao, Y . Wang, H. Fu, C. Han, G. Li, Y . Peng, Q. Sun, J. Wu, Y . Cai, Z. Ge, R. Ming, L. Xia, X. Zeng, Y . Zhu, B. Jiao, X. Zhang, G. Yu, and D. Jiang, “Step1x- edit: A practical framework for general image editing,”arXiv preprint arXiv:2504.17761, 2025
work page internal anchor Pith review arXiv 2025
-
[12]
Imagegen- cot: Enhancing text-to-image in-context learning with chain-of-thought reasoning,
J. Liao, Z. Yang, L. Li, D. Li, K. Lin, Y . Cheng, and L. Wang, “Imagegen- cot: Enhancing text-to-image in-context learning with chain-of-thought reasoning,”arXiv preprint arXiv:2503.19312, 2025
-
[13]
K. Huang, C. Duan, K. Sun, E. Xie, Z. Li, and X. Liu, “T2i- compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 5, pp. 3563–3579, 2025. [Online]. Available: https://doi.org/10.1109/TPAMI.2025.3531907
-
[14]
L. Lian, B. Li, A. Yala, and T. Darrell, “Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models,” 2024. [Online]. Available: https: //arxiv.org/abs/2305.13655
-
[15]
Editthinker: Unlocking iterative reasoning for any image editor,
H. Li, M. Zhang, D. Zheng, Z. Guo, Y . Jia, K. Feng, H. Yu, Y . Liu, Y . Feng, P. Pei, X. Cai, L. Huang, H. Li, and S. Liu, “Editthinker: Unlocking iterative reasoning for any image editor,” 2025
2025
-
[16]
S. Zhou, Q. Zhou, J. Ma, Y . Cao, R. Hu, Z. Zhang, X. Yang, Z. Wang, J. Song, C. Yu, B. Zheng, and Z. Zhao, “Spatialreward: Verifiable spatial reward modeling for fine-grained spatial consistency in text-to-image generation,” 2026. [Online]. Available: https://arxiv.org/abs/2603.22228
-
[17]
Evaluating compositional text-to-image generation: A benchmark and analysis,
Z. Wanget al., “Evaluating compositional text-to-image generation: A benchmark and analysis,” inarXiv preprint arXiv:2401.0xxx, 2024
2024
-
[18]
X. Zhao, P. Zhang, K. Tang, H. Li, Z. Zhang, G. Zhai, J. Yan, H. Yang, X. Yang, and H. Duan, “Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing,”arXiv preprint arXiv:2504.02826, 2025
-
[19]
Unified Thinker: A General Reasoning Modular Core for Image Generation
S. Zhou, Q. Zhou, J. Hu, H. Yang, Y . Cao, J. Ma, Y . Ma, J. Song, T. Ge, C. Yuet al., “Unified thinker: A general reasoning modular core for image generation,”arXiv preprint arXiv:2601.03127, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
H. Li, L. Jiang, Q. Yan, Y . Song, H. Kang, Z. Liu, X. Lu, B. Wu, and D. Cai, “Thinkrl-edit: Thinking in reinforcement learning for reasoning- centric image editing,”arXiv preprint arXiv:2601.03467, 2026
-
[21]
Thinkgen: Generalized thinking for visual generation,
S. Jiao, Y . Lin, Y . Zhong, Q. She, W. Zhou, X. Lan, Z. Huang, F. Yu, Y . Yu, Y . Zhaoet al., “Thinkgen: Generalized thinking for visual generation,” arXiv preprint arXiv:2512.23568, 2025
-
[22]
Reasonedit: Towards reasoning-enhanced image editing models,
F. Yin, S. Liu, Y . Han, Z. Wang, P. Xing, R. Wang, W. Cheng, Y . Wang, A. Li, Z. Yin, P. Chen, X. Zhang, D. Jiang, X. Zeng, and G. Yu, “Reasonedit: Towards reasoning-enhanced image editing models,” 2025
2025
-
[23]
Flow-GRPO: Training Flow Matching Models via Online RL
J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang, “Flow-grpo: Training flow matching models via online rl,” arXiv preprint arXiv:2505.05470, 2025
work page internal anchor Pith review arXiv 2025
-
[24]
DanceGRPO: Unleashing GRPO on Visual Generation
Z. Xue, J. Wu, Y . Gao, F. Kong, L. Zhu, M. Chen, Z. Liu, W. Liu, Q. Guo, W. Huanget al., “Dancegrpo: Unleashing grpo on visual generation,” arXiv preprint arXiv:2505.07818, 2025
work page internal anchor Pith review arXiv 2025
-
[25]
Text-guided human image manipulation via image-text shared space,
X. Xu, Y . Chen, X. Tao, and J. Jia, “Text-guided human image manipulation via image-text shared space,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 6486–6500, 2022. [Online]. Available: https://doi.org/10.1109/TPAMI.2021.3085339
-
[26]
Consistent image layout editing with diffusion models,
T. Xia, Y . Zhang, T. Liu, and L. Zhang, “Consistent image layout editing with diffusion models,”IEEE Transactions on Image Processing, vol. 34, pp. 6978–6992, 2025
2025
-
[27]
Instruction-driven multi-weather image translation based on a large-scale image editing model,
Y . Feng, J. Li, and M. Zhou, “Instruction-driven multi-weather image translation based on a large-scale image editing model,”IEEE Transac- tions on Image Processing, vol. 34, pp. 7462–7472, 2025
2025
-
[28]
Spherical patch generative adversarial net for unconditional panoramic image generation,
M. Xu, X. Sun, S. Li, L. Jiang, J. Xia, and X. Deng, “Spherical patch generative adversarial net for unconditional panoramic image generation,” IEEE Transactions on Image Processing, vol. 34, pp. 3833–3848, 2025
2025
-
[29]
Talk-to-edit: Fine-grained 2d and 3d facial editing via dialog,
Y . Jiang, Z. Huang, T. Wu, X. Pan, C. C. Loy, and Z. Liu, “Talk-to-edit: Fine-grained 2d and 3d facial editing via dialog,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 5, pp. 3692–3706, 2024. [Online]. Available: https://doi.org/10.1109/TPAMI.2023.3347299
-
[30]
Pixel-inconsistency modeling for image manipulation localization,
C. Kong, A. Luo, S. Wang, H. Li, A. Rocha, and A. C. Kot, “Pixel-inconsistency modeling for image manipulation localization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 6, pp. 4455–4472,
-
[31]
Available: https://doi.org/10.1109/TPAMI.2025.3541028
[Online]. Available: https://doi.org/10.1109/TPAMI.2025.3541028
-
[32]
D. Wang, R. Li, F. Han, C. Ma, W. Song, S. Wang, Y . Wang, Y . Xin, H. Liu, Z. Zhanget al., “Deepgen 1.0: A lightweight unified multimodal model for advancing image generation and editing,”arXiv preprint arXiv:2602.12205, 2026. 13
-
[33]
Unit: Unified multimodal chain-of-thought test-time scaling,
L. L. Chen, H. Ma, Z. Fan, Z. Huang, A. Sinha, X. Dai, J. Wang, Z. He, J. Yang, C. Liet al., “Unit: Unified multimodal chain-of-thought test-time scaling,”arXiv preprint arXiv:2602.12279, 2026
-
[34]
Endocot: Scaling endogenous chain-of-thought reasoning in diffusion models,
X. Dai, Y . Zhou, L. Xing, J. Bu, X. Wei, Y . Liu, B. Zhang, K. Chen, and Y . Zang, “Endocot: Scaling endogenous chain-of-thought reasoning in diffusion models,”arXiv preprint arXiv:2603.12252, 2026
-
[35]
Promptrl: Prompt matters in rl for flow-based image generation,
F.-Y . Wang, H. Zhang, M. Gharbi, H. Li, and T. Park, “Promptrl: Prompt matters in rl for flow-based image generation,”arXiv preprint arXiv:2602.01382, 2026
-
[36]
Consensus- agent deep reinforcement learning for face aging,
L. Lin, H. Liu, J. Liang, Z. Li, J. Feng, and H. Han, “Consensus- agent deep reinforcement learning for face aging,”IEEE Trans. Image Process., vol. 33, pp. 1795–1809, 2024. [Online]. Available: https://doi.org/10.1109/TIP.2024.3364074
-
[37]
Replan: Reasoning-guided region planning for complex instruction-based image editing,
T. Qu, L. Ke, X. Zhan, L. Tang, Y . Liu, B. Peng, B. Yu, D. Yu, and J. Jia, “Replan: Reasoning-guided region planning for complex instruction-based image editing,”arXiv preprint arXiv:2512.16864, 2025
-
[38]
D. Wang, C. Ma, F. Han, S. Wu, W. Song, Y . Wang, Z. Zhang, T. Wang, S. Wang, Z. Weiet al., “Unireason 1.0: A unified reasoning framework for world knowledge aligned image generation and editing,”arXiv preprint arXiv:2602.02437, 2026
-
[39]
Think-then-generate: Reasoning-aware text-to-image diffusion with llm encoders,
S. Kou, J. Jin, Z. Zhou, Y . Ma, Y . Wang, Q. Chen, P. Jiang, X. Yang, J. Zhu, K. Yuet al., “Think-then-generate: Reasoning-aware text-to-image diffusion with llm encoders,”arXiv preprint arXiv:2601.10332, 2026
-
[40]
Edit-r1: Unleashing reasoning-based reinforcement learning for image editing
H. Guo, J. Wu, J. Liu, Y . Gao, Z. Ye, L. Yuan, X. Wang, Y . Yu, and W. Huang, “Edit-r1: Unleashing reasoning-based reinforcement learning for image editing.”
-
[41]
Y . Niu, M. Ning, M. Zheng, W. Jin, B. Lin, P. Jin, J. Liao, K. Ning, C. Feng, B. Zhu, and L. Yuan, “Wise: A world knowledge-informed semantic evaluation for text-to-image generation,”arXiv preprint arXiv:2503.07265, 2025
-
[42]
Spatialclip: Learning 3d-aware image representations from spatially discriminative language,
Z. Wang, S. Zhou, S. He, H. Huang, L. Yang, Z. Zhang, X. Cheng, S. Ji, T. Jin, H. Zhao, and Z. Zhao, “Spatialclip: Learning 3d-aware image representations from spatially discriminative language,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 29 656–29 666
2025
-
[43]
Y . Gong, X. Wang, J. Wu, S. Wang, Y . Wang, and X. Wu, “Onereward: Unified mask-guided image generation via multi-task human preference learning,”arXiv preprint arXiv:2508.21066, 2025
-
[44]
C. Lin and Y . F. Wang, “Describe, spot and explain: Interpretable representation learning for discriminative visual reasoning,”IEEE Trans. Image Process., vol. 32, pp. 2481–2492, 2023. [Online]. Available: https://doi.org/10.1109/TIP.2023.3268001
-
[45]
X. Luo, J. Wang, C. Wu, S. Xiao, X. Jiang, D. Lian, J. Zhang, D. Liu et al., “Editscore: Unlocking online rl for image editing via high-fidelity reward modeling,”arXiv preprint arXiv:2509.23909, 2025
-
[46]
K. Wu, S. Jiang, M. Ku, P. Nie, M. Liu, and W. Chen, “Editreward: A human-aligned reward model for instruction-guided image editing,” arXiv preprint arXiv:2509.26346, 2025
-
[47]
Decompose and compare consistency: Measuring vlms’ answer reliability via task-decomposition consistency comparison,
Q. Yang, W. Yan, and A. Agrawal, “Decompose and compare consistency: Measuring vlms’ answer reliability via task-decomposition consistency comparison,” 2024
2024
-
[48]
A minimalist approach to llm reasoning: from rejection sampling to reinforce,
W. Xiong, J. Yao, Y . Xu, B. Pang, L. Wang, D. Sahoo, J. Li, N. Jiang, T. Zhang, C. Xionget al., “A minimalist approach to llm reasoning: from rejection sampling to reinforce,”arXiv, 2025
2025
-
[49]
Star-ds: Step-level uncertainty-aware reasoning data selection in reinforcement learning for llm multi-step reasoning
S. Wu, D. Li, W. Feng, H. Ye, J. Lou, and S.-K. Ng, “Star-ds: Step-level uncertainty-aware reasoning data selection in reinforcement learning for llm multi-step reasoning.”
-
[50]
Deepseekmath: Pushing the limits of mathematical reasoning in open language models,
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024, the origin of GRPO algorithm
2024
-
[51]
Generative multimodal models are in-context learners,
Q. Sun, Y . Cui, X. Zhang, F. Zhang, Q. Yu, Z. Luo, Y . Wang, Y . Rao, J. Liu, T. Huang, and X. Wang, “Generative multimodal models are in-context learners,” 2024
2024
-
[52]
Omnigen: Unified image generation,
S. Xiao, Y . Wang, J. Zhou, H. Yuan, X. Xing, R. Yan, C. Li, S. Wang, T. Huang, and Z. Liu, “Omnigen: Unified image generation,” pp. 13 294– 13 304, 2025
2025
-
[53]
S. Lu, Y . Li, Q.-G. Chen, Z. Xu, W. Luo, K. Zhang, and H.-J. Ye, “Ovis: Structural embedding alignment for multimodal large language model,” arXiv:2405.20797, 2024
-
[54]
Emerging Properties in Unified Multimodal Pretraining
C. Deng, D. Zhu, K. Li, C. Gou, F. Li, Z. Wang, S. Zhong, W. Yu, X. Nie, Z. Song, G. Shi, and H. Fan, “Emerging properties in unified multimodal pretraining,”arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review arXiv 2025
-
[55]
L. Qin, J. Gong, Y . Sun, T. Li, M. Yang, X. Yang, C. Qu, Z. Tan, and H. Li, “Uni-cot: Towards unified chain-of-thought reasoning across text and vision,”arXiv preprint arXiv:2508.05606, 2025
-
[56]
Magicbrush: A manually annotated dataset for instruction-guided image editing,
K. Zhang, L. Mo, W. Yu, Y . Deng, Y . Xie, and D. Liu, “Magicbrush: A manually annotated dataset for instruction-guided image editing,” in Advances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[57]
Anyedit: Mastering unified high-quality image editing for any idea,
Q. Yu, W. Chow, Z. Yue, K. Pan, Y . Wu, X. Wan, J. Li, S. Tang, H. Zhang, and Y . Zhuang, “Anyedit: Mastering unified high-quality image editing for any idea,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 26 125–26 135
2025
-
[58]
Q. Cai, J. Chen, Y . Chen, Y . Li, F. Long, Y . Pan, Z. Qiu, Y . Zhang, F. Gao, P. Xuet al., “Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer,”arXiv preprint arXiv:2505.22705, 2025
-
[59]
D. Chang, M. Cao, Y . Shi, B. Liu, S. Cai, S. Zhou, W. Huang, G. Wetzstein, M. Soleymani, and P. Wang, “Bytemorph: Benchmarking instruction-guided image editing with non-rigid motions,”arXiv preprint arXiv:2506.03107, 2025
-
[60]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
C. Wu, P. Zheng, R. Yan, S. Xiao, X. Luo, Y . Wang, W. Li, X. Jiang, Y . Liu, J. Zhou, Z. Liu, Z. Xia, C. Li, H. Deng, J. Wang, K. Luo, B. Zhang, D. Lian, X. Wang, Z. Wang, T. Huang, and Z. Liu, “Omnigen2: Exploration to advanced multimodal generation,”CoRR, vol. abs/2506.18871, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
B. Lin, Z. Li, X. Cheng, Y . Niu, Y . Ye, X. He, S. Yuan, W. Yu, S. Wang, Y . Ge, Y . Pang, and L. Yuan, “Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation,”CoRR, vol. abs/2506.03147, 2025
work page internal anchor Pith review arXiv 2025
-
[62]
Qwen3-vl technical report,
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...
2025
-
[63]
arXiv preprint arXiv:2505.16707 (2025)
Y . Wu, Z. Li, X. Hu, X. Ye, X. Zeng, G. Yu, W. Zhu, B. Schiele, M.-H. Yang, and X. Yang, “Kris-bench: Benchmarking next-level intelligent image editing models,”arXiv preprint arXiv:2505.16707, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.