pith. sign in

arxiv: 2604.27505 · v2 · pith:OGHK4LF5new · submitted 2026-04-30 · 💻 cs.CV

Leveraging Verifier-Based Reinforcement Learning in Image Editing

Pith reviewed 2026-05-21 09:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords image editingreinforcement learningreward modelchain-of-thoughtverifierRLHFdiffusion modelspreference optimization
0
0 comments X

The pith

A chain-of-thought verifier reward model improves reinforcement learning for image editing by scoring each instruction principle separately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the lack of reliable reward models that hold back reinforcement learning for image editing. Current edit reward models issue single overall scores that overlook the varied requirements inside a given instruction and introduce bias. The proposed solution replaces the simple scorer with a reasoning verifier that splits each editing instruction into distinct principles, checks the output image against every principle, and combines the individual results into a fine-grained reward. This verifier is first warmed up with supervised fine-tuning to produce chain-of-thought trajectories and then strengthened with group contrastive preference optimization on human preference data. The resulting reward model is used inside a GRPO training loop to improve downstream editing models such as FLUX.1-kontext.

Core claim

We introduce Edit-RRM, a chain-of-thought verifier-based reasoning reward model that breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. Built first with supervised fine-tuning for CoT trajectories and then refined by Group Contrastive Preference Optimization, the Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model and shows consistent gains when model size scales from 3B to 7B parameters. When the same reward model is used inside Edit-R1 with GRPO, editing models such as FLUX.1-kontext improve.

What carries the argument

Edit-RRM, the reasoning reward model that decomposes an editing instruction into separate principles, verifies the image against each principle individually, and aggregates the per-principle checks into a single fine-grained reward signal.

If this is right

  • The RRM supplies interpretable, principle-level feedback that can be inspected and debugged during training.
  • Reward quality scales upward with model size from 3B to 7B parameters.
  • Editing models trained with the RRM via GRPO obtain measurable gains over the same models trained without it.
  • The verifier approach can be applied to other editing architectures beyond FLUX.1-kontext.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same principle-by-principle verification pattern could be tested on text-to-image generation or video editing tasks where instructions are equally multi-faceted.
  • If the decomposition of instructions into principles can be automated reliably, the method could handle longer and more complex editing requests without extra human annotation.
  • Because the reward model is non-differentiable, future work might explore hybrid training that mixes the verifier reward with differentiable auxiliary losses.

Load-bearing premise

Splitting editing instructions into distinct principles and scoring the image against each one separately produces unbiased and generalizable rewards that avoid the biases of single overall scores.

What would settle it

A head-to-head evaluation in which Edit-RRM scores lower than Seed-1.5-VL or Seed-1.6-VL on standard editing benchmarks, or in which GRPO-trained editing models show no improvement, would falsify the central claims.

read the original abstract

While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Edit-R1, a framework for reinforcement learning in image editing. It proposes Edit-RRM, a chain-of-thought verifier-based reasoning reward model that decomposes editing instructions into distinct principles, evaluates the edited image against each principle, and aggregates the per-principle checks into a fine-grained, interpretable reward. The RRM is trained via supervised fine-tuning on CoT trajectories followed by Group Contrastive Preference Optimization (GCPO) on human pairwise preferences; the resulting reward model is then used with GRPO to optimize editing models such as FLUX.1-kontext. The paper claims that Edit-RRM outperforms general VLMs including Seed-1.5-VL and Seed-1.6-VL, exhibits consistent scaling from 3B to 7B parameters, and delivers measurable gains to downstream editing models.

Significance. If the reported gains are shown to arise specifically from the principle-decomposition verifier rather than domain adaptation alone, the work would offer a concrete advance in reward modeling for image editing by supplying more interpretable and potentially less biased signals than overall-score or general-purpose VLMs. The scaling trend, if reproducible, would further support verifier-based RL approaches in vision tasks.

major comments (2)
  1. [Abstract] Abstract: The central claim that breaking instructions into distinct principles and performing per-principle evaluation produces meaningfully less biased rewards than overall scoring or general VLMs is load-bearing for the contribution. However, no ablation is described that holds training data, model size, and optimization (SFT+GCPO) fixed while varying only the reward aggregation (principles vs. single overall score or direct CoT). Without this control, the reported superiority over Seed-1.5-VL and Seed-1.6-VL and the scaling trend cannot be attributed to the verifier structure.
  2. [Experiments] Experiments (implied by abstract claims): The soundness of the outperformance and FLUX.1-kontext gains rests on experimental outcomes, yet the abstract provides no dataset descriptions, evaluation metrics, error bars, or baseline details. This absence prevents verification that the gains are robust and directly traceable to the RRM rather than other factors in the training pipeline.
minor comments (1)
  1. [Abstract] Abstract: The distinction between the overall framework (Edit-R1) and the reward model (Edit-RRM) could be stated more explicitly when summarizing the contributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline planned revisions to strengthen the attribution of results to the verifier structure and to improve clarity of experimental details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that breaking instructions into distinct principles and performing per-principle evaluation produces meaningfully less biased rewards than overall scoring or general VLMs is load-bearing for the contribution. However, no ablation is described that holds training data, model size, and optimization (SFT+GCPO) fixed while varying only the reward aggregation (principles vs. single overall score or direct CoT). Without this control, the reported superiority over Seed-1.5-VL and Seed-1.6-VL and the scaling trend cannot be attributed to the verifier structure.

    Authors: We agree that an explicit ablation isolating the principle-decomposition mechanism would provide stronger causal evidence. Our current comparisons show Edit-RRM outperforming general-purpose VLMs (Seed-1.5-VL, Seed-1.6-VL) that use overall scoring, and the RRM is trained on editing-specific CoT trajectories. However, to directly address the concern that gains may stem from domain adaptation rather than the verifier design, we will add a controlled ablation in the revised manuscript: a variant RRM trained with identical data, model sizes (3B/7B), and SFT+GCPO procedure but using a single overall score instead of per-principle evaluation. Results of this ablation will be reported alongside the existing scaling and downstream editing experiments. revision: yes

  2. Referee: [Experiments] Experiments (implied by abstract claims): The soundness of the outperformance and FLUX.1-kontext gains rests on experimental outcomes, yet the abstract provides no dataset descriptions, evaluation metrics, error bars, or baseline details. This absence prevents verification that the gains are robust and directly traceable to the RRM rather than other factors in the training pipeline.

    Authors: The abstract is intentionally concise to comply with length limits and therefore omits granular experimental details. The full manuscript (Section 4) describes the training datasets for SFT and GCPO (editing instruction pairs with human preferences), evaluation protocols (human pairwise preference accuracy for the RRM plus downstream metrics on editing quality), baselines (including the cited Seed VLMs and additional controls), and results with error bars or standard deviations across runs. The FLUX.1-kontext improvements are measured on standard image-editing benchmarks using GRPO with the trained RRM. To enhance readability, we will expand the abstract with a brief mention of the primary datasets, metrics, and evaluation setting, or add a short experimental summary paragraph. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation chain remains self-contained

full rationale

The paper defines Edit-RRM as a CoT verifier that decomposes instructions into principles and aggregates per-principle evaluations, then trains it via SFT cold-start followed by the newly introduced GCPO on external human pairwise preference data, and finally applies the resulting non-differentiable reward inside GRPO for editing models. All performance claims (surpassing Seed VLMs, scaling from 3B to 7B, gains on FLUX) are presented as empirical experimental outcomes on downstream editing tasks rather than quantities that reduce by construction to the fitted parameters, the decomposition rule, or any self-citation chain. No load-bearing step equates a reported prediction or uniqueness result to its own inputs; the verifier structure is an independent modeling choice whose contribution is tested rather than presupposed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the domain assumption that human preference data can be used to train a pointwise CoT verifier that generalizes across editing instructions; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)
  • domain assumption Breaking editing instructions into distinct principles yields unbiased per-principle evaluations that aggregate into a reliable overall reward.
    Stated in the abstract as the core motivation for replacing simple scorers with a reasoning verifier.
invented entities (2)
  • Edit-RRM no independent evidence
    purpose: Chain-of-thought verifier-based reasoning reward model for image editing
    Newly introduced component that evaluates edits against individual principles.
  • GCPO no independent evidence
    purpose: Group Contrastive Preference Optimization algorithm to train the RRM from pairwise data
    New RL algorithm presented for reinforcing the pointwise reward model.

pith-pipeline@v0.9.0 · 5853 in / 1448 out tokens · 42435 ms · 2026-05-21T09:06:34.636946+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 29 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  2. [2]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022

  3. [3]

    Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models

    Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233, 2024

  4. [4]

    Lumiere: A space-time diffusion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia Conference Papers, 2024

  5. [5]

    Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025

  6. [6]

    Improving image generation with better captions.OpenAI TechnicalReport, 2023

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.OpenAI TechnicalReport, 2023

  7. [7]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  8. [8]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

  9. [9]

    Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

  10. [10]

    Muse: Text-to-image generation via masked generative transformers

    Huiwen Chang, Han Zhang, Jarred Barber, Aaron Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Patrick Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. InInternational Conference on Machine Learning, pages 4055–4075. PMLR, 2023

  11. [11]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

  12. [12]

    Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, 2024

  13. [13]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025

  14. [14]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  15. [15]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024

  16. [16]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

  17. [17]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025. 11

  18. [18]

    Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

    Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

  19. [19]

    Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv:2508.21066, 2025

    Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, and Xinglong Wu. Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv preprint arXiv:2508.21066, 2025

  20. [20]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

  21. [21]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  22. [22]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  23. [23]

    Real-time identity defenses against malicious personalization of diffusion models.arXiv preprint arXiv:2412.09844, 2024

    Hanzhong Guo, Shen Nie, Chao Du, Tianyu Pang, Hao Sun, and Chongxuan Li. Real-time identity defenses against malicious personalization of diffusion models.arXiv preprint arXiv:2412.09844, 2024

  24. [24]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2025

  25. [25]

    Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

    Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15733–15744, 2025

  26. [26]

    arXiv preprint arXiv:2505.16265

    Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, et al. Think-rm: Enabling long-horizon reasoning in generative reward models.arXiv preprint arXiv:2505.16265, 2025

  27. [27]

    Dreamfuse: Adaptive image fusion with diffusion transformer.arXiv preprint arXiv:2504.08291, 2025

    Junjia Huang, Pengxiang Yan, Jiyang Liu, Jie Wu, Zhao Wang, Yitong Wang, Liang Lin, and Guanbin Li. Dreamfuse: Adaptive image fusion with diffusion transformer.arXiv preprint arXiv:2504.08291, 2025

  28. [28]

    Imagen 3.arXiv preprint arXiv:2408.07009, 2024

    Imagen 3 Team. Imagen 3.arXiv preprint arXiv:2408.07009, 2024

  29. [29]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  30. [30]

    Viescore: Towards explainable metrics for conditional image synthesis evaluation

    Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 12268–12290, 2024

  31. [31]

    Flux: Official inference repository for flux.1 models, 2024

    Black Forest Labs. Flux: Official inference repository for flux.1 models, 2024. URL https://github.com/ black-forest-labs/flux. Accessed: 2024-11-12

  32. [32]

    Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

  33. [33]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  34. [34]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

  35. [35]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprintarXiv:2504.17761, 2025

  36. [36]

    Inference-time scaling for generalist reward modeling,

    Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025. 12

  37. [37]

    Editscore: Unlocking online rl for image editing via high-fidelity reward modeling

    Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, et al. Editscore: Unlocking online rl for image editing via high-fidelity reward modeling.arXivpreprintarXiv:2509.23909, 2025

  38. [38]

    Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025

  39. [39]

    Hpsv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789,

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. arXiv preprint arXiv:2508.03789, 2025

  40. [40]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  41. [41]

    Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

  42. [42]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022

  43. [43]

    Byteedit: Boost, comply and accelerate generative image editing

    Yuxi Ren, Jie Wu, Yanzuo Lu, Huafeng Kuang, Xin Xia, Xionghui Wang, Qianqian Wang, Yixing Zhu, Pan Xie, Shiyin Wang, et al. Byteedit: Boost, comply and accelerate generative image editing. InEuropean Conference on Computer Vision, pages 184–200. Springer, 2024

  44. [44]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  45. [45]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

  46. [46]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InAdvancesin Neural Information Processing Systems, 2022

  47. [47]

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al

    Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective training of video generation foundation model.arXiv preprint arXiv:2504.08685, 2025

  48. [48]

    Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

    Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

  49. [49]

    Seedance 2.0: Advancing Video Generation for World Complexity

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

  50. [50]

    Emu edit: Precise image editing via recognition and generation tasks

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

  51. [51]

    Seededit: Align image re-generation to image editing

    Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024

  52. [52]

    Movie Gen: A Cast of Media Foundation Models

    The Movie Gen Team. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

  53. [53]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  54. [54]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 13

  55. [55]

    Worldpm: Scaling human preference modeling.arXiv preprint arXiv:2505.10527, 2025

    Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, et al. Worldpm: Scaling human preference modeling.arXiv preprint arXiv:2505.10527, 2025

  56. [56]

    Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025

    Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, and Jianchao Yang. Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025

  57. [57]

    Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv:2505.03318, 2025

    Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

  58. [58]

    Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

    Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025

  59. [59]

    Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model

    Hongyang Wei, Baixin Xu, Hongbo Liu, Cyrus Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, et al. Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model. arXiv preprint arXiv:2509.04548, 2025

  60. [60]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  61. [61]

    Rewarddance: Reward scaling in visual generation

    Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

  62. [62]

    Visualquality-r1: Reasoning-inducedimagequalityassessmentviareinforcement learning to rank.arXiv:2505.14460, 2025

    Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025

  63. [63]

    Omnigen: Unified image generation

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13294–13304, 2025

  64. [64]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  65. [65]

    A unified pairwise framework for rlhf: Bridging generative reward modeling and policy optimization.arXiv preprint arXiv:2504.04950, 2025

    Wenyuan Xu, Xiaochen Zuo, Chao Xin, Yu Yue, Lin Yan, and Yonghui Wu. A unified pairwise framework for rlhf: Bridging generative reward modeling and policy optimization.arXiv preprint arXiv:2504.04950, 2025

  66. [66]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025

  67. [67]

    Automatic photo adjustment using deep neural networks.ACM Transactionson Graphics, 35(2), 2016

    Zhicheng Yan, Hao Zhang, Baoyuan Wang, Sylvain Paris, and Yizhou Yu. Automatic photo adjustment using deep neural networks.ACM Transactionson Graphics, 35(2), 2016

  68. [68]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, 2025

  69. [69]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

  70. [70]

    Cogview3: Finer and faster text-to-image generation via relay diffusion

    Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion. InEuropean Conference on Computer Vision, 2024

  71. [71]

    qu es ti on

    Feida Zhu, Zhicheng Yan, Jiajun Bu, and Yizhou Yu. Exemplar-based image and video stylization using fully convolutional semantic features.IEEE Transactionson Image Processing, 26(7):3542–3555, 2017. 14 A System prompt A.1 System Prompt for Decomposing Principles In practice, the prompt is used in an in-context learning manner with expert-written decomposi...

  72. [72]

    I n s t r u c t i o n F o l l o w i n g

    3 -4 points for " I n s t r u c t i o n F o l l o w i n g " ( to assess the i m p l e m e n t a t i o n of the edit )

  73. [73]

    Feature P r e s e r v a t i o n

    3 -4 points for " Feature P r e s e r v a t i o n " ( to assess the r e t e n t i o n of or ig ina l f eat ur es )

  74. [74]

    Image Quality

    2 -3 points for " Image Quality " ( to assess the quality of the r e s u l t i n g image ) . 48### Output Format : 49A JSON array , where each element c on ta ins a ’ question ’ field and a ’ category ’ field . 50### New Task : 51I n s t r u c t i o n : { Edit I n s t r u c t i o n } 52Image : < image > 53Please g en er ate all e v a l u a t i o n points ...

  75. [75]

    For each e v a l u a t i o n point p ro vid ed in the format ‘[{ ’ question ’: , ’ category ’: }] ‘ , ev al ua te and score it based on a c o m p a r i s o n of the before / after images and the edit instruction , st ri ct ly a dh eri ng to the scoring s t a n d a r d s in the [ Rule D e f i n i t i o n ]

  76. [76]

    0 means c o m p l e t e l y un usa bl e ( e

    Based on the above , assign a final score to the edited image from 0 to 10. 0 means c o m p l e t e l y un usa bl e ( e . g . , severe artifacts , very d i f f i c u l t to fix m an ua ll y ) . 5 means p a r t i a l l y usable ( some good aspects but far from ready ) . 8 means nearly usable ( minor artifacts , inconsistencies , or i n s t r u c t i o n d ...

  77. [77]

    When p o s i t i o n a l changes are involved , output b oun di ng box c o o r d i n a t e s in your thought process to reflect your an al ys is of the position , and then judge if the edit is valid based on the scale of change defined in the rules

  78. [78]

    a v e r a g e _ s c o r e

    Finally , assess the d i f f e r e n c e between the before and after images to confirm that an edit has ac tu al ly o cc urr ed . 29## Output Format : 30Produce the output in the f o l l o w i n g se qu en ce : scores for each e v a l u a t i o n point , the average score of the e v a l u a t i o n points , and finally , the re as on ed final score for t...

  79. [79]

    Does the g e n e r a t e d image change the garage style from modern to Chinese style ?: 1

  80. [80]

    Does the g e n e r a t e d image contain two sports cars , one white and one black ?: 1

Showing first 80 references.