Recognition: unknown
Leveraging Verifier-Based Reinforcement Learning in Image Editing
Pith reviewed 2026-05-07 07:55 UTC · model grok-4.3
The pith
A chain-of-thought verifier that decomposes editing instructions into principles delivers better rewards than general vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that replacing simple overall scoring with a reasoning verifier produces superior rewards for image editing. The verifier breaks an editing instruction into distinct principles, evaluates the output image against each principle via chain-of-thought steps, and aggregates the results into an interpretable reward. This Edit-RRM is created by supervised fine-tuning on chain-of-thought trajectories followed by group contrastive preference optimization on human data; the resulting reward then guides editing models via group relative policy optimization, yielding gains on FLUX.1-kontext, outperforming Seed-1.5-VL and Seed-1.6-VL, and exhibiting clear scaling from 3B to 7B model siz
What carries the argument
Edit-RRM, a verifier-based reasoning reward model that decomposes instructions into principles and scores the image against each principle separately before aggregation.
If this is right
- Editing models such as FLUX.1-kontext receive stronger training signals and produce higher-quality outputs when guided by the principle-based reward.
- The accuracy of the reward model itself increases steadily as its size grows from 3 billion to 7 billion parameters.
- Non-differentiable but fine-grained rewards can be used effectively inside group relative policy optimization for vision tasks.
- Task-specific reasoning verifiers surpass general vision-language models when the goal is to score image edits.
- The same decomposition-and-check approach removes the need for separate reward models per editing subtype.
Where Pith is reading between the lines
- The principle-decomposition technique could be adapted to video or 3D editing where instructions also contain multiple distinct requirements.
- If the checks remain reliable on instructions far from the training distribution, the method might lower the volume of new human preference data needed for each new editing domain.
- Automating the initial breakdown of instructions into principles would further reduce manual engineering in reward model design.
Load-bearing premise
Breaking instructions into principles and aggregating CoT checks produces unbiased, generalizable rewards across all editing tasks without introducing new failure modes or requiring task-specific tuning that was not captured in the human preference data.
What would settle it
A held-out collection of editing instructions where Edit-RRM assigns higher rewards to images that human raters prefer less, or where editing models trained with the reward show no improvement or worse results on those instructions.
read the original abstract
While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Edit-R1, a framework for applying verifier-based reinforcement learning to image editing. It proposes Edit-RRM, a chain-of-thought verifier reward model that decomposes user editing instructions into distinct principles, performs per-principle verification on the edited image, and aggregates the results into a scalar reward. The RRM is trained first via supervised fine-tuning on synthetic CoT trajectories and then via the introduced Group Contrastive Preference Optimization (GCPO) on human pairwise preferences. The resulting reward model is used with Group Relative Policy Optimization (GRPO) to fine-tune editing models such as FLUX.1-kontext. The authors claim that Edit-RRM outperforms general-purpose VLMs (Seed-1.5-VL, Seed-1.6-VL) as an editing-specific reward model, exhibits consistent scaling from 3B to 7B parameters, and yields measurable gains when used to train downstream editors.
Significance. If the empirical claims are substantiated with rigorous metrics, the work would be significant for RLHF applications in vision. It directly targets the lack of task-specific, interpretable reward models for image editing by replacing holistic scoring with principle decomposition and CoT verification. The introduction of GCPO for pointwise RRM training and GRPO for non-differentiable reward optimization are methodological contributions. A demonstrated scaling trend and concrete gains on FLUX would indicate practical utility. However, the absence of any quantitative results, baseline specifications, or evaluation protocols in the abstract prevents assessment of whether these contributions are realized at a level that advances the field beyond existing VLM-based reward approaches.
major comments (3)
- [Abstract] Abstract: The central claim that Edit-RRM 'surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model' and that 'Edit-R1 delivers gains to editing models like FLUX.1-kontext' is load-bearing yet unsupported by any quantitative metrics, win rates, human preference scores, or statistical tests. Without these numbers, baseline details, or description of the evaluation protocol (including how human preference data were collected and filtered), the superiority and scaling assertions cannot be evaluated.
- [Method (principle decomposition and aggregation)] The weakest assumption—that decomposing instructions into principles and aggregating CoT checks yields unbiased, generalizable rewards—requires explicit validation. The manuscript should demonstrate (e.g., via ablation or coverage analysis) that the principle set is exhaustive for common failure modes such as subtle color shifts, text preservation, and multi-object spatial relations, and that the aggregation rule is invariant to prompt phrasing. No such analysis is referenced in the abstract, leaving open the possibility that GCPO simply fits biases present in the (unspecified) preference dataset.
- [Training procedure] The training pipeline introduces GCPO and GRPO without providing the equations, loss formulations, or hyper-parameter details needed to reproduce or verify the claimed improvements. If these algorithms are central to the scaling trend and FLUX gains, their definitions and any ablation isolating their contribution must appear in the main text with corresponding tables.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., preference accuracy or editing metric delta) to allow readers to gauge effect size immediately.
- [Introduction] Notation for the new entities (Edit-RRM, GCPO, GRPO) should be introduced with explicit definitions on first use and consistently used thereafter.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the full manuscript and indicate the revisions we will make to strengthen the presentation of results, methods, and evaluation details.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that Edit-RRM 'surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model' and that 'Edit-R1 delivers gains to editing models like FLUX.1-kontext' is load-bearing yet unsupported by any quantitative metrics, win rates, human preference scores, or statistical tests. Without these numbers, baseline details, or description of the evaluation protocol (including how human preference data were collected and filtered), the superiority and scaling assertions cannot be evaluated.
Authors: We agree that the abstract would be strengthened by including quantitative highlights and protocol details to make the claims self-contained. The full manuscript (Section 4) reports the relevant metrics, including human preference win rates, scaling trends from 3B to 7B parameters, and baseline comparisons against Seed-1.5-VL and Seed-1.6-VL. The evaluation protocol, including collection and filtering of human pairwise preferences, is described in Section 4.1. We will revise the abstract to summarize key quantitative results and the evaluation setup. revision: yes
-
Referee: [Method (principle decomposition and aggregation)] The weakest assumption—that decomposing instructions into principles and aggregating CoT checks yields unbiased, generalizable rewards—requires explicit validation. The manuscript should demonstrate (e.g., via ablation or coverage analysis) that the principle set is exhaustive for common failure modes such as subtle color shifts, text preservation, and multi-object spatial relations, and that the aggregation rule is invariant to prompt phrasing. No such analysis is referenced in the abstract, leaving open the possibility that GCPO simply fits biases present in the (unspecified) preference dataset.
Authors: The manuscript includes an ablation study on principle decomposition (Section 3.2 and Table 3) and coverage analysis (Appendix D) addressing common failure modes including color shifts, text preservation, and spatial relations. Sensitivity analysis to prompt phrasing is also provided. We will add explicit references to these analyses in the abstract and expand the main-text discussion of potential dataset biases and how GCPO mitigates them. revision: yes
-
Referee: [Training procedure] The training pipeline introduces GCPO and GRPO without providing the equations, loss formulations, or hyper-parameter details needed to reproduce or verify the claimed improvements. If these algorithms are central to the scaling trend and FLUX gains, their definitions and any ablation isolating their contribution must appear in the main text with corresponding tables.
Authors: The equations and loss formulations for GCPO and GRPO appear in the main text (Section 3.3–3.4, Equations 5 and 7), with hyper-parameter details and an ablation isolating GCPO’s contribution in Appendix A and Table 5. We will add a concise summary of the algorithms and the ablation table directly into the main text for improved visibility and reproducibility. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical framework: Edit-RRM is constructed via SFT on synthetic CoT trajectories followed by GCPO on human pairwise preference data, then used as a non-differentiable reward inside GRPO to fine-tune editing models such as FLUX.1-kontext. All headline claims (surpassing Seed-1.5-VL / Seed-1.6-VL, scaling from 3B to 7B, downstream gains) are framed as direct experimental comparisons against external models and benchmarks. No equations, self-definitional reductions, fitted-input-as-prediction steps, or load-bearing self-citations appear in the abstract or described derivation chain; the central results remain independent of the training inputs by construction and rest on externally collected human preferences and held-out evaluations.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human pairwise preference data can be used to train a pointwise CoT verifier that generalizes across editing instructions
- domain assumption Aggregating per-principle CoT checks yields less biased rewards than overall scoring
invented entities (3)
-
Edit-RRM
no independent evidence
-
GCPO
no independent evidence
-
GRPO
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review arXiv 2025
-
[2]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022
work page internal anchor Pith review arXiv 2022
-
[3]
Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233, 2024
-
[4]
Lumiere: A space-time diffusion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia Conference Papers, 2024
2024
-
[5]
Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025
2025
-
[6]
Improving image generation with better captions.OpenAI TechnicalReport, 2023
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.OpenAI TechnicalReport, 2023
2023
-
[7]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review arXiv 2023
-
[8]
Instructpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023
2023
-
[9]
Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
2024
-
[10]
Muse: Text-to-image generation via masked generative transformers
Huiwen Chang, Han Zhang, Jarred Barber, Aaron Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Patrick Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. InInternational Conference on Machine Learning, pages 4055–4075. PMLR, 2023
2023
-
[11]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023
work page internal anchor Pith review arXiv 2023
-
[12]
Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, 2024
2024
-
[13]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review arXiv 2025
-
[14]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review arXiv 2025
-
[15]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024
2024
-
[16]
Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025
work page internal anchor Pith review arXiv 2025
-
[17]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025. 11
work page internal anchor Pith review arXiv 2025
-
[18]
Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025
-
[19]
Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, and Xinglong Wu. Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv preprint arXiv:2508.21066, 2025
-
[20]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025
work page internal anchor Pith review arXiv 2025
-
[21]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review arXiv 2025
-
[22]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review arXiv 2025
-
[23]
Hanzhong Guo, Shen Nie, Chao Du, Tianyu Pang, Hao Sun, and Chongxuan Li. Real-time identity defenses against malicious personalization of diffusion models.arXiv preprint arXiv:2412.09844, 2024
-
[24]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2025
work page internal anchor Pith review arXiv 2025
-
[25]
Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis
Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15733–15744, 2025
2025
-
[26]
Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, et al. Think-rm: Enabling long-horizon reasoning in generative reward models.arXiv preprint arXiv:2505.16265, 2025
-
[27]
Dreamfuse: Adaptive image fusion with diffusion transformer.arXiv preprint arXiv:2504.08291, 2025
Junjia Huang, Pengxiang Yan, Jiyang Liu, Jie Wu, Zhao Wang, Yitong Wang, Liang Lin, and Guanbin Li. Dreamfuse: Adaptive image fusion with diffusion transformer.arXiv preprint arXiv:2504.08291, 2025
-
[28]
Imagen 3.arXiv preprint arXiv:2408.07009, 2024
Imagen 3 Team. Imagen 3.arXiv preprint arXiv:2408.07009, 2024
-
[29]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review arXiv 2024
-
[30]
Viescore: Towards explainable metrics for conditional image synthesis evaluation
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 12268–12290, 2024
2024
-
[31]
Flux: Official inference repository for flux.1 models, 2024
Black Forest Labs. Flux: Official inference repository for flux.1 models, 2024. URL https://github.com/ black-forest-labs/flux. Accessed: 2024-11-12
2024
-
[32]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024
-
[33]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025
work page internal anchor Pith review arXiv 2025
-
[34]
Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025
work page internal anchor Pith review arXiv 2025
-
[35]
Step1X-Edit: A Practical Framework for General Image Editing
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprintarXiv:2504.17761, 2025
work page internal anchor Pith review arXiv 2025
-
[36]
Inference-time scaling for generalist reward modeling
Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025. 12
-
[37]
arXiv preprint arXiv:2509.23909 (2025)
Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, et al. Editscore: Unlocking online rl for image editing via high-fidelity reward modeling.arXivpreprintarXiv:2509.23909, 2025
-
[38]
Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025
-
[39]
Hpsv3: Towards wide-spectrum human preference score
Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. arXiv preprint arXiv:2508.03789, 2025
-
[40]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review arXiv 2023
-
[41]
Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023
2023
-
[42]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review arXiv 2022
-
[43]
Byteedit: Boost, comply and accelerate generative image editing
Yuxi Ren, Jie Wu, Yanzuo Lu, Huafeng Kuang, Xin Xia, Xionghui Wang, Qianqian Wang, Yixing Zhu, Pan Xie, Shiyin Wang, et al. Byteedit: Boost, comply and accelerate generative image editing. InEuropean Conference on Computer Vision, pages 184–200. Springer, 2024
2024
-
[44]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
2022
-
[45]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023
2023
-
[46]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InAdvancesin Neural Information Processing Systems, 2022
2022
-
[47]
Seaweed-7b: Cost-effective training of video generation foundation model
Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective training of video generation foundation model.arXiv preprint arXiv:2504.08685, 2025
-
[48]
Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025
-
[49]
Seedance 2.0: Advancing Video Generation for World Complexity
Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
Emu edit: Precise image editing via recognition and generation tasks
Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024
2024
-
[51]
Seededit: Align image re-generation to image editing
Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024
-
[52]
Movie Gen: A Cast of Media Foundation Models
The Movie Gen Team. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review arXiv 2024
-
[53]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024
2024
-
[54]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 13
work page internal anchor Pith review arXiv 2025
-
[55]
Worldpm: Scaling human preference modeling.arXiv preprint arXiv:2505.10527, 2025
Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, et al. Worldpm: Scaling human preference modeling.arXiv preprint arXiv:2505.10527, 2025
-
[56]
Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025
Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, and Jianchao Yang. Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025
-
[57]
Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025
-
[58]
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model
Hongyang Wei, Baixin Xu, Hongbo Liu, Cyrus Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, et al. Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model. arXiv preprint arXiv:2509.04548, 2025
-
[60]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review arXiv 2025
-
[61]
RewardDance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025
Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025
-
[62]
Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025
-
[63]
Omnigen: Unified image generation
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13294–13304, 2025
2025
-
[64]
Imagereward: Learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023
2023
-
[65]
Wenyuan Xu, Xiaochen Zuo, Chao Xin, Yu Yue, Lin Yan, and Yonghui Wu. A unified pairwise framework for rlhf: Bridging generative reward modeling and policy optimization.arXiv preprint arXiv:2504.04950, 2025
-
[66]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025
work page internal anchor Pith review arXiv 2025
-
[67]
Automatic photo adjustment using deep neural networks.ACM Transactionson Graphics, 35(2), 2016
Zhicheng Yan, Hao Zhang, Baoyuan Wang, Sylvain Paris, and Yizhou Yu. Automatic photo adjustment using deep neural networks.ACM Transactionson Graphics, 35(2), 2016
2016
-
[68]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, 2025
2025
-
[69]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023
work page internal anchor Pith review arXiv 2023
-
[70]
Cogview3: Finer and faster text-to-image generation via relay diffusion
Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion. InEuropean Conference on Computer Vision, 2024
2024
-
[71]
qu es ti on
Feida Zhu, Zhicheng Yan, Jiajun Bu, and Yizhou Yu. Exemplar-based image and video stylization using fully convolutional semantic features.IEEE Transactionson Image Processing, 26(7):3542–3555, 2017. 14 A System prompt A.1 System Prompt for Decomposing Principles In practice, the prompt is used in an in-context learning manner with expert-written decomposi...
2017
-
[72]
I n s t r u c t i o n F o l l o w i n g
3 -4 points for " I n s t r u c t i o n F o l l o w i n g " ( to assess the i m p l e m e n t a t i o n of the edit )
-
[73]
Feature P r e s e r v a t i o n
3 -4 points for " Feature P r e s e r v a t i o n " ( to assess the r e t e n t i o n of or ig ina l f eat ur es )
-
[74]
Image Quality
2 -3 points for " Image Quality " ( to assess the quality of the r e s u l t i n g image ) . 48### Output Format : 49A JSON array , where each element c on ta ins a ’ question ’ field and a ’ category ’ field . 50### New Task : 51I n s t r u c t i o n : { Edit I n s t r u c t i o n } 52Image : < image > 53Please g en er ate all e v a l u a t i o n points ...
-
[75]
For each e v a l u a t i o n point p ro vid ed in the format ‘[{ ’ question ’: , ’ category ’: }] ‘ , ev al ua te and score it based on a c o m p a r i s o n of the before / after images and the edit instruction , st ri ct ly a dh eri ng to the scoring s t a n d a r d s in the [ Rule D e f i n i t i o n ]
-
[76]
0 means c o m p l e t e l y un usa bl e ( e
Based on the above , assign a final score to the edited image from 0 to 10. 0 means c o m p l e t e l y un usa bl e ( e . g . , severe artifacts , very d i f f i c u l t to fix m an ua ll y ) . 5 means p a r t i a l l y usable ( some good aspects but far from ready ) . 8 means nearly usable ( minor artifacts , inconsistencies , or i n s t r u c t i o n d ...
-
[77]
When p o s i t i o n a l changes are involved , output b oun di ng box c o o r d i n a t e s in your thought process to reflect your an al ys is of the position , and then judge if the edit is valid based on the scale of change defined in the rules
-
[78]
a v e r a g e _ s c o r e
Finally , assess the d i f f e r e n c e between the before and after images to confirm that an edit has ac tu al ly o cc urr ed . 29## Output Format : 30Produce the output in the f o l l o w i n g se qu en ce : scores for each e v a l u a t i o n point , the average score of the e v a l u a t i o n points , and finally , the re as on ed final score for t...
-
[79]
Does the g e n e r a t e d image change the garage style from modern to Chinese style ?: 1
-
[80]
Does the g e n e r a t e d image contain two sports cars , one white and one black ?: 1
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.