Leveraging Verifier-Based Reinforcement Learning in Image Editing
Pith reviewed 2026-05-21 09:06 UTC · model grok-4.3
The pith
A chain-of-thought verifier reward model improves reinforcement learning for image editing by scoring each instruction principle separately.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Edit-RRM, a chain-of-thought verifier-based reasoning reward model that breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. Built first with supervised fine-tuning for CoT trajectories and then refined by Group Contrastive Preference Optimization, the Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model and shows consistent gains when model size scales from 3B to 7B parameters. When the same reward model is used inside Edit-R1 with GRPO, editing models such as FLUX.1-kontext improve.
What carries the argument
Edit-RRM, the reasoning reward model that decomposes an editing instruction into separate principles, verifies the image against each principle individually, and aggregates the per-principle checks into a single fine-grained reward signal.
If this is right
- The RRM supplies interpretable, principle-level feedback that can be inspected and debugged during training.
- Reward quality scales upward with model size from 3B to 7B parameters.
- Editing models trained with the RRM via GRPO obtain measurable gains over the same models trained without it.
- The verifier approach can be applied to other editing architectures beyond FLUX.1-kontext.
Where Pith is reading between the lines
- The same principle-by-principle verification pattern could be tested on text-to-image generation or video editing tasks where instructions are equally multi-faceted.
- If the decomposition of instructions into principles can be automated reliably, the method could handle longer and more complex editing requests without extra human annotation.
- Because the reward model is non-differentiable, future work might explore hybrid training that mixes the verifier reward with differentiable auxiliary losses.
Load-bearing premise
Splitting editing instructions into distinct principles and scoring the image against each one separately produces unbiased and generalizable rewards that avoid the biases of single overall scores.
What would settle it
A head-to-head evaluation in which Edit-RRM scores lower than Seed-1.5-VL or Seed-1.6-VL on standard editing benchmarks, or in which GRPO-trained editing models show no improvement, would falsify the central claims.
read the original abstract
While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Edit-R1, a framework for reinforcement learning in image editing. It proposes Edit-RRM, a chain-of-thought verifier-based reasoning reward model that decomposes editing instructions into distinct principles, evaluates the edited image against each principle, and aggregates the per-principle checks into a fine-grained, interpretable reward. The RRM is trained via supervised fine-tuning on CoT trajectories followed by Group Contrastive Preference Optimization (GCPO) on human pairwise preferences; the resulting reward model is then used with GRPO to optimize editing models such as FLUX.1-kontext. The paper claims that Edit-RRM outperforms general VLMs including Seed-1.5-VL and Seed-1.6-VL, exhibits consistent scaling from 3B to 7B parameters, and delivers measurable gains to downstream editing models.
Significance. If the reported gains are shown to arise specifically from the principle-decomposition verifier rather than domain adaptation alone, the work would offer a concrete advance in reward modeling for image editing by supplying more interpretable and potentially less biased signals than overall-score or general-purpose VLMs. The scaling trend, if reproducible, would further support verifier-based RL approaches in vision tasks.
major comments (2)
- [Abstract] Abstract: The central claim that breaking instructions into distinct principles and performing per-principle evaluation produces meaningfully less biased rewards than overall scoring or general VLMs is load-bearing for the contribution. However, no ablation is described that holds training data, model size, and optimization (SFT+GCPO) fixed while varying only the reward aggregation (principles vs. single overall score or direct CoT). Without this control, the reported superiority over Seed-1.5-VL and Seed-1.6-VL and the scaling trend cannot be attributed to the verifier structure.
- [Experiments] Experiments (implied by abstract claims): The soundness of the outperformance and FLUX.1-kontext gains rests on experimental outcomes, yet the abstract provides no dataset descriptions, evaluation metrics, error bars, or baseline details. This absence prevents verification that the gains are robust and directly traceable to the RRM rather than other factors in the training pipeline.
minor comments (1)
- [Abstract] Abstract: The distinction between the overall framework (Edit-R1) and the reward model (Edit-RRM) could be stated more explicitly when summarizing the contributions.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and outline planned revisions to strengthen the attribution of results to the verifier structure and to improve clarity of experimental details.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that breaking instructions into distinct principles and performing per-principle evaluation produces meaningfully less biased rewards than overall scoring or general VLMs is load-bearing for the contribution. However, no ablation is described that holds training data, model size, and optimization (SFT+GCPO) fixed while varying only the reward aggregation (principles vs. single overall score or direct CoT). Without this control, the reported superiority over Seed-1.5-VL and Seed-1.6-VL and the scaling trend cannot be attributed to the verifier structure.
Authors: We agree that an explicit ablation isolating the principle-decomposition mechanism would provide stronger causal evidence. Our current comparisons show Edit-RRM outperforming general-purpose VLMs (Seed-1.5-VL, Seed-1.6-VL) that use overall scoring, and the RRM is trained on editing-specific CoT trajectories. However, to directly address the concern that gains may stem from domain adaptation rather than the verifier design, we will add a controlled ablation in the revised manuscript: a variant RRM trained with identical data, model sizes (3B/7B), and SFT+GCPO procedure but using a single overall score instead of per-principle evaluation. Results of this ablation will be reported alongside the existing scaling and downstream editing experiments. revision: yes
-
Referee: [Experiments] Experiments (implied by abstract claims): The soundness of the outperformance and FLUX.1-kontext gains rests on experimental outcomes, yet the abstract provides no dataset descriptions, evaluation metrics, error bars, or baseline details. This absence prevents verification that the gains are robust and directly traceable to the RRM rather than other factors in the training pipeline.
Authors: The abstract is intentionally concise to comply with length limits and therefore omits granular experimental details. The full manuscript (Section 4) describes the training datasets for SFT and GCPO (editing instruction pairs with human preferences), evaluation protocols (human pairwise preference accuracy for the RRM plus downstream metrics on editing quality), baselines (including the cited Seed VLMs and additional controls), and results with error bars or standard deviations across runs. The FLUX.1-kontext improvements are measured on standard image-editing benchmarks using GRPO with the trained RRM. To enhance readability, we will expand the abstract with a brief mention of the primary datasets, metrics, and evaluation setting, or add a short experimental summary paragraph. revision: partial
Circularity Check
No significant circularity; derivation chain remains self-contained
full rationale
The paper defines Edit-RRM as a CoT verifier that decomposes instructions into principles and aggregates per-principle evaluations, then trains it via SFT cold-start followed by the newly introduced GCPO on external human pairwise preference data, and finally applies the resulting non-differentiable reward inside GRPO for editing models. All performance claims (surpassing Seed VLMs, scaling from 3B to 7B, gains on FLUX) are presented as empirical experimental outcomes on downstream editing tasks rather than quantities that reduce by construction to the fitted parameters, the decomposition rule, or any self-citation chain. No load-bearing step equates a reported prediction or uniqueness result to its own inputs; the verifier structure is an independent modeling choice whose contribution is tested rather than presupposed.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Breaking editing instructions into distinct principles yields unbiased per-principle evaluations that aggregate into a reliable overall reward.
invented entities (2)
-
Edit-RRM
no independent evidence
-
GCPO
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models
Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233, 2024
-
[4]
Lumiere: A space-time diffusion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia Conference Papers, 2024
work page 2024
-
[5]
Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025
work page 2025
-
[6]
Improving image generation with better captions.OpenAI TechnicalReport, 2023
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.OpenAI TechnicalReport, 2023
work page 2023
-
[7]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Instructpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023
work page 2023
-
[9]
Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
work page 2024
-
[10]
Muse: Text-to-image generation via masked generative transformers
Huiwen Chang, Han Zhang, Jarred Barber, Aaron Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Patrick Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. InInternational Conference on Machine Learning, pages 4055–4075. PMLR, 2023
work page 2023
-
[11]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, 2024
work page 2024
-
[13]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024
work page 2024
-
[16]
Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model
Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, and Xinglong Wu. Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv preprint arXiv:2508.21066, 2025
-
[20]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Hanzhong Guo, Shen Nie, Chao Du, Tianyu Pang, Hao Sun, and Chongxuan Li. Real-time identity defenses against malicious personalization of diffusion models.arXiv preprint arXiv:2412.09844, 2024
-
[24]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis
Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15733–15744, 2025
work page 2025
-
[26]
arXiv preprint arXiv:2505.16265
Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, et al. Think-rm: Enabling long-horizon reasoning in generative reward models.arXiv preprint arXiv:2505.16265, 2025
-
[27]
Dreamfuse: Adaptive image fusion with diffusion transformer.arXiv preprint arXiv:2504.08291, 2025
Junjia Huang, Pengxiang Yan, Jiyang Liu, Jie Wu, Zhao Wang, Yitong Wang, Liang Lin, and Guanbin Li. Dreamfuse: Adaptive image fusion with diffusion transformer.arXiv preprint arXiv:2504.08291, 2025
-
[28]
Imagen 3.arXiv preprint arXiv:2408.07009, 2024
Imagen 3 Team. Imagen 3.arXiv preprint arXiv:2408.07009, 2024
-
[29]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Viescore: Towards explainable metrics for conditional image synthesis evaluation
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 12268–12290, 2024
work page 2024
-
[31]
Flux: Official inference repository for flux.1 models, 2024
Black Forest Labs. Flux: Official inference repository for flux.1 models, 2024. URL https://github.com/ black-forest-labs/flux. Accessed: 2024-11-12
work page 2024
-
[32]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Step1X-Edit: A Practical Framework for General Image Editing
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprintarXiv:2504.17761, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Inference-time scaling for generalist reward modeling,
Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025. 12
-
[37]
Editscore: Unlocking online rl for image editing via high-fidelity reward modeling
Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, et al. Editscore: Unlocking online rl for image editing via high-fidelity reward modeling.arXivpreprintarXiv:2509.23909, 2025
-
[38]
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Hpsv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789,
Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. arXiv preprint arXiv:2508.03789, 2025
-
[40]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[42]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Byteedit: Boost, comply and accelerate generative image editing
Yuxi Ren, Jie Wu, Yanzuo Lu, Huafeng Kuang, Xin Xia, Xionghui Wang, Qianqian Wang, Yixing Zhu, Pan Xie, Shiyin Wang, et al. Byteedit: Boost, comply and accelerate generative image editing. InEuropean Conference on Computer Vision, pages 184–200. Springer, 2024
work page 2024
-
[44]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[45]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023
work page 2023
-
[46]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InAdvancesin Neural Information Processing Systems, 2022
work page 2022
-
[47]
Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective training of video generation foundation model.arXiv preprint arXiv:2504.08685, 2025
-
[48]
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Seedance 2.0: Advancing Video Generation for World Complexity
Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
Emu edit: Precise image editing via recognition and generation tasks
Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024
work page 2024
-
[51]
Seededit: Align image re-generation to image editing
Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024
-
[52]
Movie Gen: A Cast of Media Foundation Models
The Movie Gen Team. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024
work page 2024
-
[54]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Worldpm: Scaling human preference modeling.arXiv preprint arXiv:2505.10527, 2025
Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, et al. Worldpm: Scaling human preference modeling.arXiv preprint arXiv:2505.10527, 2025
-
[56]
Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025
Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, and Jianchao Yang. Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025
-
[57]
Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025
-
[58]
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model
Hongyang Wei, Baixin Xu, Hongbo Liu, Cyrus Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, et al. Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model. arXiv preprint arXiv:2509.04548, 2025
-
[60]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
Rewarddance: Reward scaling in visual generation
Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025
-
[62]
Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025
-
[63]
Omnigen: Unified image generation
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13294–13304, 2025
work page 2025
-
[64]
Imagereward: Learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023
work page 2023
-
[65]
Wenyuan Xu, Xiaochen Zuo, Chao Xin, Yu Yue, Lin Yan, and Yonghui Wu. A unified pairwise framework for rlhf: Bridging generative reward modeling and policy optimization.arXiv preprint arXiv:2504.04950, 2025
-
[66]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Automatic photo adjustment using deep neural networks.ACM Transactionson Graphics, 35(2), 2016
Zhicheng Yan, Hao Zhang, Baoyuan Wang, Sylvain Paris, and Yizhou Yu. Automatic photo adjustment using deep neural networks.ACM Transactionson Graphics, 35(2), 2016
work page 2016
-
[68]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, 2025
work page 2025
-
[69]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
Cogview3: Finer and faster text-to-image generation via relay diffusion
Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion. InEuropean Conference on Computer Vision, 2024
work page 2024
-
[71]
Feida Zhu, Zhicheng Yan, Jiajun Bu, and Yizhou Yu. Exemplar-based image and video stylization using fully convolutional semantic features.IEEE Transactionson Image Processing, 26(7):3542–3555, 2017. 14 A System prompt A.1 System Prompt for Decomposing Principles In practice, the prompt is used in an in-context learning manner with expert-written decomposi...
work page 2017
-
[72]
I n s t r u c t i o n F o l l o w i n g
3 -4 points for " I n s t r u c t i o n F o l l o w i n g " ( to assess the i m p l e m e n t a t i o n of the edit )
-
[73]
Feature P r e s e r v a t i o n
3 -4 points for " Feature P r e s e r v a t i o n " ( to assess the r e t e n t i o n of or ig ina l f eat ur es )
-
[74]
2 -3 points for " Image Quality " ( to assess the quality of the r e s u l t i n g image ) . 48### Output Format : 49A JSON array , where each element c on ta ins a ’ question ’ field and a ’ category ’ field . 50### New Task : 51I n s t r u c t i o n : { Edit I n s t r u c t i o n } 52Image : < image > 53Please g en er ate all e v a l u a t i o n points ...
-
[75]
For each e v a l u a t i o n point p ro vid ed in the format ‘[{ ’ question ’: , ’ category ’: }] ‘ , ev al ua te and score it based on a c o m p a r i s o n of the before / after images and the edit instruction , st ri ct ly a dh eri ng to the scoring s t a n d a r d s in the [ Rule D e f i n i t i o n ]
-
[76]
0 means c o m p l e t e l y un usa bl e ( e
Based on the above , assign a final score to the edited image from 0 to 10. 0 means c o m p l e t e l y un usa bl e ( e . g . , severe artifacts , very d i f f i c u l t to fix m an ua ll y ) . 5 means p a r t i a l l y usable ( some good aspects but far from ready ) . 8 means nearly usable ( minor artifacts , inconsistencies , or i n s t r u c t i o n d ...
-
[77]
When p o s i t i o n a l changes are involved , output b oun di ng box c o o r d i n a t e s in your thought process to reflect your an al ys is of the position , and then judge if the edit is valid based on the scale of change defined in the rules
-
[78]
Finally , assess the d i f f e r e n c e between the before and after images to confirm that an edit has ac tu al ly o cc urr ed . 29## Output Format : 30Produce the output in the f o l l o w i n g se qu en ce : scores for each e v a l u a t i o n point , the average score of the e v a l u a t i o n points , and finally , the re as on ed final score for t...
-
[79]
Does the g e n e r a t e d image change the garage style from modern to Chinese style ?: 1
-
[80]
Does the g e n e r a t e d image contain two sports cars , one white and one black ?: 1
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.