arxiv: 2604.27505 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

Leveraging Verifier-Based Reinforcement Learning in Image Editing

Hanzhong Guo , Jie Wu , Jie Liu , Yu Gao , Zilyu Ye , Linxiao Yuan , Xionghui Wang , Yizhou Yu

show 1 more author

Weilin Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords image editingreinforcement learningreward modelchain-of-thoughtverifierpreference optimizationvision-language modelsRLHF

0 comments

The pith

A chain-of-thought verifier that decomposes editing instructions into principles delivers better rewards than general vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the missing piece for reinforcement learning in image editing: a reward model that can handle varied instructions without giving biased overall scores. It builds Edit-RRM by breaking each user instruction into separate principles, using chain-of-thought reasoning to check the edited image against every principle, and then combining those checks into one reward value. The model is first trained with supervised fine-tuning to produce reasoning traces, then refined with group contrastive preference optimization on human preference pairs. This reward is plugged into group relative policy optimization to train editing networks such as FLUX.1-kontext, producing measurable gains while showing consistent improvement as the reward model grows from 3B to 7B parameters and beating strong general models like Seed-1.5-VL.

Core claim

The paper claims that replacing simple overall scoring with a reasoning verifier produces superior rewards for image editing. The verifier breaks an editing instruction into distinct principles, evaluates the output image against each principle via chain-of-thought steps, and aggregates the results into an interpretable reward. This Edit-RRM is created by supervised fine-tuning on chain-of-thought trajectories followed by group contrastive preference optimization on human data; the resulting reward then guides editing models via group relative policy optimization, yielding gains on FLUX.1-kontext, outperforming Seed-1.5-VL and Seed-1.6-VL, and exhibiting clear scaling from 3B to 7B model siz

What carries the argument

Edit-RRM, a verifier-based reasoning reward model that decomposes instructions into principles and scores the image against each principle separately before aggregation.

If this is right

Editing models such as FLUX.1-kontext receive stronger training signals and produce higher-quality outputs when guided by the principle-based reward.
The accuracy of the reward model itself increases steadily as its size grows from 3 billion to 7 billion parameters.
Non-differentiable but fine-grained rewards can be used effectively inside group relative policy optimization for vision tasks.
Task-specific reasoning verifiers surpass general vision-language models when the goal is to score image edits.
The same decomposition-and-check approach removes the need for separate reward models per editing subtype.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The principle-decomposition technique could be adapted to video or 3D editing where instructions also contain multiple distinct requirements.
If the checks remain reliable on instructions far from the training distribution, the method might lower the volume of new human preference data needed for each new editing domain.
Automating the initial breakdown of instructions into principles would further reduce manual engineering in reward model design.

Load-bearing premise

Breaking instructions into principles and aggregating CoT checks produces unbiased, generalizable rewards across all editing tasks without introducing new failure modes or requiring task-specific tuning that was not captured in the human preference data.

What would settle it

A held-out collection of editing instructions where Edit-RRM assigns higher rewards to images that human raters prefer less, or where editing models trained with the reward show no improvement or worse results on those instructions.

read the original abstract

While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a CoT-based verifier reward model for image editing with a new GCPO training step, but the abstract supplies no numbers or data details to back the superiority claims.

read the letter

The main takeaway is that this work tries to fix coarse reward signals in image editing by turning the reward model into a step-by-step verifier. It decomposes an edit instruction into principles, runs CoT checks on each, and aggregates them into a scalar reward that can then drive GRPO training on models like FLUX. That framing directly targets the bias problem they describe in prior edit reward models. The GCPO algorithm for converting pairwise preferences into a pointwise reasoning model is the clearest piece of new machinery here, along with the overall pipeline that applies it downstream to editing rather than just text-to-image generation. The SFT cold-start on synthetic CoT trajectories is a practical touch that makes the verifier trainable at all. These elements give the paper a coherent technical story that extends standard RLHF ideas without obvious internal contradictions. The evidence side is thinner. The abstract asserts that the resulting Edit-RRM beats Seed-1.5-VL and Seed-1.6-VL, shows scaling from 3B to 7B, and improves FLUX editing, yet it contains no scores, no baseline tables, no description of the human preference collection process, and no mention of statistical tests or data filtering. That absence makes it impossible to judge whether the gains are real or whether the principle decomposition actually avoids the coverage gaps the stress-test note flags. If the full paper does not include ablations on principle completeness or checks against rare failure modes like subtle color shifts or text preservation, the bias concern stays open. This is aimed at people already working on reward modeling for generative vision tasks who want concrete recipes for making rewards more interpretable. A reader could extract the GCPO idea and the verifier structure even if the empirical claims need heavy revision. I would send it to peer review because the problem is real and the proposed direction is specific enough to be refereeable, but only if the full manuscript supplies the missing metrics and validation steps.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Edit-R1, a framework for applying verifier-based reinforcement learning to image editing. It proposes Edit-RRM, a chain-of-thought verifier reward model that decomposes user editing instructions into distinct principles, performs per-principle verification on the edited image, and aggregates the results into a scalar reward. The RRM is trained first via supervised fine-tuning on synthetic CoT trajectories and then via the introduced Group Contrastive Preference Optimization (GCPO) on human pairwise preferences. The resulting reward model is used with Group Relative Policy Optimization (GRPO) to fine-tune editing models such as FLUX.1-kontext. The authors claim that Edit-RRM outperforms general-purpose VLMs (Seed-1.5-VL, Seed-1.6-VL) as an editing-specific reward model, exhibits consistent scaling from 3B to 7B parameters, and yields measurable gains when used to train downstream editors.

Significance. If the empirical claims are substantiated with rigorous metrics, the work would be significant for RLHF applications in vision. It directly targets the lack of task-specific, interpretable reward models for image editing by replacing holistic scoring with principle decomposition and CoT verification. The introduction of GCPO for pointwise RRM training and GRPO for non-differentiable reward optimization are methodological contributions. A demonstrated scaling trend and concrete gains on FLUX would indicate practical utility. However, the absence of any quantitative results, baseline specifications, or evaluation protocols in the abstract prevents assessment of whether these contributions are realized at a level that advances the field beyond existing VLM-based reward approaches.

major comments (3)

[Abstract] Abstract: The central claim that Edit-RRM 'surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model' and that 'Edit-R1 delivers gains to editing models like FLUX.1-kontext' is load-bearing yet unsupported by any quantitative metrics, win rates, human preference scores, or statistical tests. Without these numbers, baseline details, or description of the evaluation protocol (including how human preference data were collected and filtered), the superiority and scaling assertions cannot be evaluated.
[Method (principle decomposition and aggregation)] The weakest assumption—that decomposing instructions into principles and aggregating CoT checks yields unbiased, generalizable rewards—requires explicit validation. The manuscript should demonstrate (e.g., via ablation or coverage analysis) that the principle set is exhaustive for common failure modes such as subtle color shifts, text preservation, and multi-object spatial relations, and that the aggregation rule is invariant to prompt phrasing. No such analysis is referenced in the abstract, leaving open the possibility that GCPO simply fits biases present in the (unspecified) preference dataset.
[Training procedure] The training pipeline introduces GCPO and GRPO without providing the equations, loss formulations, or hyper-parameter details needed to reproduce or verify the claimed improvements. If these algorithms are central to the scaling trend and FLUX gains, their definitions and any ablation isolating their contribution must appear in the main text with corresponding tables.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., preference accuracy or editing metric delta) to allow readers to gauge effect size immediately.
[Introduction] Notation for the new entities (Edit-RRM, GCPO, GRPO) should be introduced with explicit definitions on first use and consistently used thereafter.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the full manuscript and indicate the revisions we will make to strengthen the presentation of results, methods, and evaluation details.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that Edit-RRM 'surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model' and that 'Edit-R1 delivers gains to editing models like FLUX.1-kontext' is load-bearing yet unsupported by any quantitative metrics, win rates, human preference scores, or statistical tests. Without these numbers, baseline details, or description of the evaluation protocol (including how human preference data were collected and filtered), the superiority and scaling assertions cannot be evaluated.

Authors: We agree that the abstract would be strengthened by including quantitative highlights and protocol details to make the claims self-contained. The full manuscript (Section 4) reports the relevant metrics, including human preference win rates, scaling trends from 3B to 7B parameters, and baseline comparisons against Seed-1.5-VL and Seed-1.6-VL. The evaluation protocol, including collection and filtering of human pairwise preferences, is described in Section 4.1. We will revise the abstract to summarize key quantitative results and the evaluation setup. revision: yes
Referee: [Method (principle decomposition and aggregation)] The weakest assumption—that decomposing instructions into principles and aggregating CoT checks yields unbiased, generalizable rewards—requires explicit validation. The manuscript should demonstrate (e.g., via ablation or coverage analysis) that the principle set is exhaustive for common failure modes such as subtle color shifts, text preservation, and multi-object spatial relations, and that the aggregation rule is invariant to prompt phrasing. No such analysis is referenced in the abstract, leaving open the possibility that GCPO simply fits biases present in the (unspecified) preference dataset.

Authors: The manuscript includes an ablation study on principle decomposition (Section 3.2 and Table 3) and coverage analysis (Appendix D) addressing common failure modes including color shifts, text preservation, and spatial relations. Sensitivity analysis to prompt phrasing is also provided. We will add explicit references to these analyses in the abstract and expand the main-text discussion of potential dataset biases and how GCPO mitigates them. revision: yes
Referee: [Training procedure] The training pipeline introduces GCPO and GRPO without providing the equations, loss formulations, or hyper-parameter details needed to reproduce or verify the claimed improvements. If these algorithms are central to the scaling trend and FLUX gains, their definitions and any ablation isolating their contribution must appear in the main text with corresponding tables.

Authors: The equations and loss formulations for GCPO and GRPO appear in the main text (Section 3.3–3.4, Equations 5 and 7), with hyper-parameter details and an ablation isolating GCPO’s contribution in Appendix A and Table 5. We will add a concise summary of the algorithms and the ablation table directly into the main text for improved visibility and reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical framework: Edit-RRM is constructed via SFT on synthetic CoT trajectories followed by GCPO on human pairwise preference data, then used as a non-differentiable reward inside GRPO to fine-tune editing models such as FLUX.1-kontext. All headline claims (surpassing Seed-1.5-VL / Seed-1.6-VL, scaling from 3B to 7B, downstream gains) are framed as direct experimental comparisons against external models and benchmarks. No equations, self-definitional reductions, fitted-input-as-prediction steps, or load-bearing self-citations appear in the abstract or described derivation chain; the central results remain independent of the training inputs by construction and rest on externally collected human preferences and held-out evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claims rest on the effectiveness of newly introduced algorithmic components (CoT verifier RRM, GCPO, GRPO) whose performance is asserted empirically; the abstract invokes standard assumptions that human pairwise preferences are sufficient to train a general editing reward model and that CoT reasoning traces can be reliably aggregated without task-specific bias.

axioms (2)

domain assumption Human pairwise preference data can be used to train a pointwise CoT verifier that generalizes across editing instructions
Invoked when moving from SFT cold-start to GCPO reinforcement of the RRM and when claiming superiority over VLMs.
domain assumption Aggregating per-principle CoT checks yields less biased rewards than overall scoring
Core premise stated in the motivation for moving from simple scorer to reasoning verifier.

invented entities (3)

Edit-RRM no independent evidence
purpose: Chain-of-thought verifier-based reasoning reward model that evaluates edited images against broken-down instruction principles
Newly introduced component whose performance is the main empirical claim.
GCPO no independent evidence
purpose: Group Contrastive Preference Optimization algorithm to train the pointwise RRM from pairwise human data
New RL algorithm presented for the RRM training stage.
GRPO no independent evidence
purpose: Reinforcement learning algorithm to train image editing models using the non-differentiable Edit-RRM reward
New or adapted optimization method for the downstream editing stage.

pith-pipeline@v0.9.0 · 5622 in / 1803 out tokens · 49111 ms · 2026-05-07T07:55:59.209588+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 47 canonical work pages · 25 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review arXiv 2025
[2]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022

work page internal anchor Pith review arXiv 2022
[3]

Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233, 2024

work page arXiv 2024
[4]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia Conference Papers, 2024

2024
[5]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025

2025
[6]

Improving image generation with better captions.OpenAI TechnicalReport, 2023

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.OpenAI TechnicalReport, 2023

2023
[7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review arXiv 2023
[8]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

2023
[9]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024
[10]

Muse: Text-to-image generation via masked generative transformers

Huiwen Chang, Han Zhang, Jarred Barber, Aaron Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Patrick Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. InInternational Conference on Machine Learning, pages 4055–4075. PMLR, 2023

2023
[11]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review arXiv 2023
[12]

Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, 2024

2024
[13]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review arXiv 2025
[14]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review arXiv 2025
[15]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024

2024
[16]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

work page internal anchor Pith review arXiv 2025
[17]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025. 11

work page internal anchor Pith review arXiv 2025
[18]

Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

work page arXiv 2025
[19]

Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv preprint arXiv:2508.21066, 2025

Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, and Xinglong Wu. Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv preprint arXiv:2508.21066, 2025

work page arXiv 2025
[20]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review arXiv 2025
[21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review arXiv 2025
[22]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review arXiv 2025
[23]

Real-time identity defenses against malicious personalization of diffusion models.arXiv preprint arXiv:2412.09844, 2024

Hanzhong Guo, Shen Nie, Chao Du, Tianyu Pang, Hao Sun, and Chongxuan Li. Real-time identity defenses against malicious personalization of diffusion models.arXiv preprint arXiv:2412.09844, 2024

work page arXiv 2024
[24]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2025

work page internal anchor Pith review arXiv 2025
[25]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15733–15744, 2025

2025
[26]

Think-rm: Enabling long-horizon reasoning in generative reward models.arXiv preprint arXiv:2505.16265, 2025

Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, et al. Think-rm: Enabling long-horizon reasoning in generative reward models.arXiv preprint arXiv:2505.16265, 2025

work page arXiv 2025
[27]

Dreamfuse: Adaptive image fusion with diffusion transformer.arXiv preprint arXiv:2504.08291, 2025

Junjia Huang, Pengxiang Yan, Jiyang Liu, Jie Wu, Zhao Wang, Yitong Wang, Liang Lin, and Guanbin Li. Dreamfuse: Adaptive image fusion with diffusion transformer.arXiv preprint arXiv:2504.08291, 2025

work page arXiv 2025
[28]

Imagen 3.arXiv preprint arXiv:2408.07009, 2024

Imagen 3 Team. Imagen 3.arXiv preprint arXiv:2408.07009, 2024

work page arXiv 2024
[29]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review arXiv 2024
[30]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 12268–12290, 2024

2024
[31]

Flux: Official inference repository for flux.1 models, 2024

Black Forest Labs. Flux: Official inference repository for flux.1 models, 2024. URL https://github.com/ black-forest-labs/flux. Accessed: 2024-11-12

2024
[32]

net/forum?id=POWv6hDd9XH

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

work page arXiv 2024
[33]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review arXiv 2025
[34]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review arXiv 2025
[35]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprintarXiv:2504.17761, 2025

work page internal anchor Pith review arXiv 2025
[36]

Inference-time scaling for generalist reward modeling

Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025. 12

work page arXiv 2025
[37]

arXiv preprint arXiv:2509.23909 (2025)

Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, et al. Editscore: Unlocking online rl for image editing via high-fidelity reward modeling.arXivpreprintarXiv:2509.23909, 2025

work page arXiv 2025
[38]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model, February 2025

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025

work page arXiv 2025
[39]

Hpsv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. arXiv preprint arXiv:2508.03789, 2025

work page arXiv 2025
[40]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review arXiv 2023
[41]

Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

2023
[42]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review arXiv 2022
[43]

Byteedit: Boost, comply and accelerate generative image editing

Yuxi Ren, Jie Wu, Yanzuo Lu, Huafeng Kuang, Xin Xia, Xionghui Wang, Qianqian Wang, Yixing Zhu, Pan Xie, Shiyin Wang, et al. Byteedit: Boost, comply and accelerate generative image editing. InEuropean Conference on Computer Vision, pages 184–200. Springer, 2024

2024
[44]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[45]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

2023
[46]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InAdvancesin Neural Information Processing Systems, 2022

2022
[47]

Seaweed-7b: Cost-effective training of video generation foundation model

Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective training of video generation foundation model.arXiv preprint arXiv:2504.08685, 2025

work page arXiv 2025
[48]

Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

work page arXiv 2025
[49]

Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

2024
[51]

Seededit: Align image re-generation to image editing

Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024

work page arXiv 2024
[52]

Movie Gen: A Cast of Media Foundation Models

The Movie Gen Team. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review arXiv 2024
[53]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

2024
[54]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 13

work page internal anchor Pith review arXiv 2025
[55]

Worldpm: Scaling human preference modeling.arXiv preprint arXiv:2505.10527, 2025

Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, et al. Worldpm: Scaling human preference modeling.arXiv preprint arXiv:2505.10527, 2025

work page arXiv 2025
[56]

Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025

Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, and Jianchao Yang. Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025

work page arXiv 2025
[57]

Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

work page arXiv 2025
[58]

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model

Hongyang Wei, Baixin Xu, Hongbo Liu, Cyrus Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, et al. Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model. arXiv preprint arXiv:2509.04548, 2025

work page arXiv 2025
[60]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review arXiv 2025
[61]

RewardDance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

work page arXiv 2025
[62]

Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025

Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025

work page arXiv 2025
[63]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13294–13304, 2025

2025
[64]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

2023
[65]

A unified pairwise framework for rlhf: Bridging generative reward modeling and policy optimization.arXiv preprint arXiv:2504.04950, 2025

Wenyuan Xu, Xiaochen Zuo, Chao Xin, Yu Yue, Lin Yan, and Yonghui Wu. A unified pairwise framework for rlhf: Bridging generative reward modeling and policy optimization.arXiv preprint arXiv:2504.04950, 2025

work page arXiv 2025
[66]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review arXiv 2025
[67]

Automatic photo adjustment using deep neural networks.ACM Transactionson Graphics, 35(2), 2016

Zhicheng Yan, Hao Zhang, Baoyuan Wang, Sylvain Paris, and Yizhou Yu. Automatic photo adjustment using deep neural networks.ACM Transactionson Graphics, 35(2), 2016

2016
[68]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, 2025

2025
[69]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review arXiv 2023
[70]

Cogview3: Finer and faster text-to-image generation via relay diffusion

Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion. InEuropean Conference on Computer Vision, 2024

2024
[71]

qu es ti on

Feida Zhu, Zhicheng Yan, Jiajun Bu, and Yizhou Yu. Exemplar-based image and video stylization using fully convolutional semantic features.IEEE Transactionson Image Processing, 26(7):3542–3555, 2017. 14 A System prompt A.1 System Prompt for Decomposing Principles In practice, the prompt is used in an in-context learning manner with expert-written decomposi...

2017
[72]

I n s t r u c t i o n F o l l o w i n g

3 -4 points for " I n s t r u c t i o n F o l l o w i n g " ( to assess the i m p l e m e n t a t i o n of the edit )
[73]

Feature P r e s e r v a t i o n

3 -4 points for " Feature P r e s e r v a t i o n " ( to assess the r e t e n t i o n of or ig ina l f eat ur es )
[74]

Image Quality

2 -3 points for " Image Quality " ( to assess the quality of the r e s u l t i n g image ) . 48### Output Format : 49A JSON array , where each element c on ta ins a ’ question ’ field and a ’ category ’ field . 50### New Task : 51I n s t r u c t i o n : { Edit I n s t r u c t i o n } 52Image : < image > 53Please g en er ate all e v a l u a t i o n points ...
[75]

For each e v a l u a t i o n point p ro vid ed in the format ‘[{ ’ question ’: , ’ category ’: }] ‘ , ev al ua te and score it based on a c o m p a r i s o n of the before / after images and the edit instruction , st ri ct ly a dh eri ng to the scoring s t a n d a r d s in the [ Rule D e f i n i t i o n ]
[76]

0 means c o m p l e t e l y un usa bl e ( e

Based on the above , assign a final score to the edited image from 0 to 10. 0 means c o m p l e t e l y un usa bl e ( e . g . , severe artifacts , very d i f f i c u l t to fix m an ua ll y ) . 5 means p a r t i a l l y usable ( some good aspects but far from ready ) . 8 means nearly usable ( minor artifacts , inconsistencies , or i n s t r u c t i o n d ...
[77]

When p o s i t i o n a l changes are involved , output b oun di ng box c o o r d i n a t e s in your thought process to reflect your an al ys is of the position , and then judge if the edit is valid based on the scale of change defined in the rules
[78]

a v e r a g e _ s c o r e

Finally , assess the d i f f e r e n c e between the before and after images to confirm that an edit has ac tu al ly o cc urr ed . 29## Output Format : 30Produce the output in the f o l l o w i n g se qu en ce : scores for each e v a l u a t i o n point , the average score of the e v a l u a t i o n points , and finally , the re as on ed final score for t...
[79]

Does the g e n e r a t e d image change the garage style from modern to Chinese style ?: 1
[80]

Does the g e n e r a t e d image contain two sports cars , one white and one black ?: 1

Showing first 80 references.