Leveraging Verifier-Based Reinforcement Learning in Image Editing

Hanzhong Guo; Jie Liu; Jie Wu; Linxiao Yuan; Weilin Huang; Xionghui Wang; Yizhou Yu; Yu Gao; Zilyu Ye

arxiv: 2604.27505 · v2 · pith:OGHK4LF5new · submitted 2026-04-30 · 💻 cs.CV

Leveraging Verifier-Based Reinforcement Learning in Image Editing

Hanzhong Guo , Jie Wu , Jie Liu , Yu Gao , Zilyu Ye , Linxiao Yuan , Xionghui Wang , Yizhou Yu

show 1 more author

Weilin Huang

This is my paper

Pith reviewed 2026-05-21 09:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords image editingreinforcement learningreward modelchain-of-thoughtverifierRLHFdiffusion modelspreference optimization

0 comments

The pith

A chain-of-thought verifier reward model improves reinforcement learning for image editing by scoring each instruction principle separately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the lack of reliable reward models that hold back reinforcement learning for image editing. Current edit reward models issue single overall scores that overlook the varied requirements inside a given instruction and introduce bias. The proposed solution replaces the simple scorer with a reasoning verifier that splits each editing instruction into distinct principles, checks the output image against every principle, and combines the individual results into a fine-grained reward. This verifier is first warmed up with supervised fine-tuning to produce chain-of-thought trajectories and then strengthened with group contrastive preference optimization on human preference data. The resulting reward model is used inside a GRPO training loop to improve downstream editing models such as FLUX.1-kontext.

Core claim

We introduce Edit-RRM, a chain-of-thought verifier-based reasoning reward model that breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. Built first with supervised fine-tuning for CoT trajectories and then refined by Group Contrastive Preference Optimization, the Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model and shows consistent gains when model size scales from 3B to 7B parameters. When the same reward model is used inside Edit-R1 with GRPO, editing models such as FLUX.1-kontext improve.

What carries the argument

Edit-RRM, the reasoning reward model that decomposes an editing instruction into separate principles, verifies the image against each principle individually, and aggregates the per-principle checks into a single fine-grained reward signal.

If this is right

The RRM supplies interpretable, principle-level feedback that can be inspected and debugged during training.
Reward quality scales upward with model size from 3B to 7B parameters.
Editing models trained with the RRM via GRPO obtain measurable gains over the same models trained without it.
The verifier approach can be applied to other editing architectures beyond FLUX.1-kontext.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same principle-by-principle verification pattern could be tested on text-to-image generation or video editing tasks where instructions are equally multi-faceted.
If the decomposition of instructions into principles can be automated reliably, the method could handle longer and more complex editing requests without extra human annotation.
Because the reward model is non-differentiable, future work might explore hybrid training that mixes the verifier reward with differentiable auxiliary losses.

Load-bearing premise

Splitting editing instructions into distinct principles and scoring the image against each one separately produces unbiased and generalizable rewards that avoid the biases of single overall scores.

What would settle it

A head-to-head evaluation in which Edit-RRM scores lower than Seed-1.5-VL or Seed-1.6-VL on standard editing benchmarks, or in which GRPO-trained editing models show no improvement, would falsify the central claims.

read the original abstract

While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a principle-based CoT verifier for image editing rewards and claims it beats general VLMs while scaling, but the gains may trace more to editing-specific fine-tuning than the decomposition itself.

read the letter

The main thing to know is that this work builds a reward model for image editing by splitting instructions into separate principles, running chain-of-thought checks on each one against the edited image, and aggregating the results into a more interpretable score. They then use that model inside RL to improve editing outputs on models like FLUX.1-kontext. The new piece is the Group Contrastive Preference Optimization step that turns human pairwise preferences into training signal for their pointwise reasoning reward model after an initial SFT cold start. That combination, plus the downstream GRPO training of the editor, is the concrete method they contribute. It is useful that they move away from blunt overall scores and try to make the reward checks more granular and tied to the instruction parts. Reporting a scaling trend from 3B to 7B parameters and some measured improvement on FLUX is also straightforward evidence that the approach is at least workable in practice. The soft spot is exactly the one the stress-test note flags. Without an ablation that holds the training data, model size, and optimization fixed while swapping only the reward structure (principles versus single overall score or plain CoT), it is hard to tell whether the verifier design itself drives the reported outperformance over Seed VLMs or whether the gains come mostly from domain adaptation on editing trajectories. The abstract gives no dataset sizes, preference counts, or error bars, so the strength of the central claim stays difficult to judge from the summary alone. This paper is for researchers working on reward modeling and RL for conditional image generation. Someone already thinking about fine-grained rewards or verifier-style models in vision would find the architecture and the GCPO trick worth examining, even if they end up running their own controls. The work shows honest engagement with the RLHF literature and a clear problem statement, so it has enough substance for a serious referee to spend time on. I would send it to peer review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Edit-R1, a framework for reinforcement learning in image editing. It proposes Edit-RRM, a chain-of-thought verifier-based reasoning reward model that decomposes editing instructions into distinct principles, evaluates the edited image against each principle, and aggregates the per-principle checks into a fine-grained, interpretable reward. The RRM is trained via supervised fine-tuning on CoT trajectories followed by Group Contrastive Preference Optimization (GCPO) on human pairwise preferences; the resulting reward model is then used with GRPO to optimize editing models such as FLUX.1-kontext. The paper claims that Edit-RRM outperforms general VLMs including Seed-1.5-VL and Seed-1.6-VL, exhibits consistent scaling from 3B to 7B parameters, and delivers measurable gains to downstream editing models.

Significance. If the reported gains are shown to arise specifically from the principle-decomposition verifier rather than domain adaptation alone, the work would offer a concrete advance in reward modeling for image editing by supplying more interpretable and potentially less biased signals than overall-score or general-purpose VLMs. The scaling trend, if reproducible, would further support verifier-based RL approaches in vision tasks.

major comments (2)

[Abstract] Abstract: The central claim that breaking instructions into distinct principles and performing per-principle evaluation produces meaningfully less biased rewards than overall scoring or general VLMs is load-bearing for the contribution. However, no ablation is described that holds training data, model size, and optimization (SFT+GCPO) fixed while varying only the reward aggregation (principles vs. single overall score or direct CoT). Without this control, the reported superiority over Seed-1.5-VL and Seed-1.6-VL and the scaling trend cannot be attributed to the verifier structure.
[Experiments] Experiments (implied by abstract claims): The soundness of the outperformance and FLUX.1-kontext gains rests on experimental outcomes, yet the abstract provides no dataset descriptions, evaluation metrics, error bars, or baseline details. This absence prevents verification that the gains are robust and directly traceable to the RRM rather than other factors in the training pipeline.

minor comments (1)

[Abstract] Abstract: The distinction between the overall framework (Edit-R1) and the reward model (Edit-RRM) could be stated more explicitly when summarizing the contributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline planned revisions to strengthen the attribution of results to the verifier structure and to improve clarity of experimental details.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that breaking instructions into distinct principles and performing per-principle evaluation produces meaningfully less biased rewards than overall scoring or general VLMs is load-bearing for the contribution. However, no ablation is described that holds training data, model size, and optimization (SFT+GCPO) fixed while varying only the reward aggregation (principles vs. single overall score or direct CoT). Without this control, the reported superiority over Seed-1.5-VL and Seed-1.6-VL and the scaling trend cannot be attributed to the verifier structure.

Authors: We agree that an explicit ablation isolating the principle-decomposition mechanism would provide stronger causal evidence. Our current comparisons show Edit-RRM outperforming general-purpose VLMs (Seed-1.5-VL, Seed-1.6-VL) that use overall scoring, and the RRM is trained on editing-specific CoT trajectories. However, to directly address the concern that gains may stem from domain adaptation rather than the verifier design, we will add a controlled ablation in the revised manuscript: a variant RRM trained with identical data, model sizes (3B/7B), and SFT+GCPO procedure but using a single overall score instead of per-principle evaluation. Results of this ablation will be reported alongside the existing scaling and downstream editing experiments. revision: yes
Referee: [Experiments] Experiments (implied by abstract claims): The soundness of the outperformance and FLUX.1-kontext gains rests on experimental outcomes, yet the abstract provides no dataset descriptions, evaluation metrics, error bars, or baseline details. This absence prevents verification that the gains are robust and directly traceable to the RRM rather than other factors in the training pipeline.

Authors: The abstract is intentionally concise to comply with length limits and therefore omits granular experimental details. The full manuscript (Section 4) describes the training datasets for SFT and GCPO (editing instruction pairs with human preferences), evaluation protocols (human pairwise preference accuracy for the RRM plus downstream metrics on editing quality), baselines (including the cited Seed VLMs and additional controls), and results with error bars or standard deviations across runs. The FLUX.1-kontext improvements are measured on standard image-editing benchmarks using GRPO with the trained RRM. To enhance readability, we will expand the abstract with a brief mention of the primary datasets, metrics, and evaluation setting, or add a short experimental summary paragraph. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation chain remains self-contained

full rationale

The paper defines Edit-RRM as a CoT verifier that decomposes instructions into principles and aggregates per-principle evaluations, then trains it via SFT cold-start followed by the newly introduced GCPO on external human pairwise preference data, and finally applies the resulting non-differentiable reward inside GRPO for editing models. All performance claims (surpassing Seed VLMs, scaling from 3B to 7B, gains on FLUX) are presented as empirical experimental outcomes on downstream editing tasks rather than quantities that reduce by construction to the fitted parameters, the decomposition rule, or any self-citation chain. No load-bearing step equates a reported prediction or uniqueness result to its own inputs; the verifier structure is an independent modeling choice whose contribution is tested rather than presupposed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the domain assumption that human preference data can be used to train a pointwise CoT verifier that generalizes across editing instructions; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Breaking editing instructions into distinct principles yields unbiased per-principle evaluations that aggregate into a reliable overall reward.
Stated in the abstract as the core motivation for replacing simple scorers with a reasoning verifier.

invented entities (2)

Edit-RRM no independent evidence
purpose: Chain-of-thought verifier-based reasoning reward model for image editing
Newly introduced component that evaluates edits against individual principles.
GCPO no independent evidence
purpose: Group Contrastive Preference Optimization algorithm to train the RRM from pairwise data
New RL algorithm presented for reinforcing the pointwise reward model.

pith-pipeline@v0.9.0 · 5853 in / 1448 out tokens · 42435 ms · 2026-05-21T09:06:34.636946+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 29 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models

Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233, 2024

work page arXiv 2024
[4]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia Conference Papers, 2024

work page 2024
[5]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025

work page 2025
[6]

Improving image generation with better captions.OpenAI TechnicalReport, 2023

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.OpenAI TechnicalReport, 2023

work page 2023
[7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

work page 2023
[9]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

work page 2024
[10]

Muse: Text-to-image generation via masked generative transformers

Huiwen Chang, Han Zhang, Jarred Barber, Aaron Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Patrick Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. InInternational Conference on Machine Learning, pages 4055–4075. PMLR, 2023

work page 2023
[11]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, 2024

work page 2024
[13]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024

work page 2024
[16]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv:2508.21066, 2025

Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, and Xinglong Wu. Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv preprint arXiv:2508.21066, 2025

work page arXiv 2025
[20]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Real-time identity defenses against malicious personalization of diffusion models.arXiv preprint arXiv:2412.09844, 2024

Hanzhong Guo, Shen Nie, Chao Du, Tianyu Pang, Hao Sun, and Chongxuan Li. Real-time identity defenses against malicious personalization of diffusion models.arXiv preprint arXiv:2412.09844, 2024

work page arXiv 2024
[24]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15733–15744, 2025

work page 2025
[26]

arXiv preprint arXiv:2505.16265

Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, et al. Think-rm: Enabling long-horizon reasoning in generative reward models.arXiv preprint arXiv:2505.16265, 2025

work page arXiv 2025
[27]

Dreamfuse: Adaptive image fusion with diffusion transformer.arXiv preprint arXiv:2504.08291, 2025

Junjia Huang, Pengxiang Yan, Jiyang Liu, Jie Wu, Zhao Wang, Yitong Wang, Liang Lin, and Guanbin Li. Dreamfuse: Adaptive image fusion with diffusion transformer.arXiv preprint arXiv:2504.08291, 2025

work page arXiv 2025
[28]

Imagen 3.arXiv preprint arXiv:2408.07009, 2024

Imagen 3 Team. Imagen 3.arXiv preprint arXiv:2408.07009, 2024

work page arXiv 2024
[29]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 12268–12290, 2024

work page 2024
[31]

Flux: Official inference repository for flux.1 models, 2024

Black Forest Labs. Flux: Official inference repository for flux.1 models, 2024. URL https://github.com/ black-forest-labs/flux. Accessed: 2024-11-12

work page 2024
[32]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprintarXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Inference-time scaling for generalist reward modeling,

Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025. 12

work page arXiv 2025
[37]

Editscore: Unlocking online rl for image editing via high-fidelity reward modeling

Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, et al. Editscore: Unlocking online rl for image editing via high-fidelity reward modeling.arXivpreprintarXiv:2509.23909, 2025

work page arXiv 2025
[38]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Hpsv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789,

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. arXiv preprint arXiv:2508.03789, 2025

work page arXiv 2025
[40]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

work page 2023
[42]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Byteedit: Boost, comply and accelerate generative image editing

Yuxi Ren, Jie Wu, Yanzuo Lu, Huafeng Kuang, Xin Xia, Xionghui Wang, Qianqian Wang, Yixing Zhu, Pan Xie, Shiyin Wang, et al. Byteedit: Boost, comply and accelerate generative image editing. InEuropean Conference on Computer Vision, pages 184–200. Springer, 2024

work page 2024
[44]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[45]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

work page 2023
[46]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InAdvancesin Neural Information Processing Systems, 2022

work page 2022
[47]

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al

Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective training of video generation foundation model.arXiv preprint arXiv:2504.08685, 2025

work page arXiv 2025
[48]

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

work page 2024
[51]

Seededit: Align image re-generation to image editing

Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024

work page arXiv 2024
[52]

Movie Gen: A Cast of Media Foundation Models

The Movie Gen Team. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

work page 2024
[54]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Worldpm: Scaling human preference modeling.arXiv preprint arXiv:2505.10527, 2025

Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, et al. Worldpm: Scaling human preference modeling.arXiv preprint arXiv:2505.10527, 2025

work page arXiv 2025
[56]

Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025

Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, and Jianchao Yang. Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025

work page arXiv 2025
[57]

Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv:2505.03318, 2025

Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

work page arXiv 2025
[58]

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model

Hongyang Wei, Baixin Xu, Hongbo Liu, Cyrus Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, et al. Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model. arXiv preprint arXiv:2509.04548, 2025

work page arXiv 2025
[60]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Rewarddance: Reward scaling in visual generation

Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

work page arXiv 2025
[62]

Visualquality-r1: Reasoning-inducedimagequalityassessmentviareinforcement learning to rank.arXiv:2505.14460, 2025

Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025

work page arXiv 2025
[63]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13294–13304, 2025

work page 2025
[64]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

work page 2023
[65]

A unified pairwise framework for rlhf: Bridging generative reward modeling and policy optimization.arXiv preprint arXiv:2504.04950, 2025

Wenyuan Xu, Xiaochen Zuo, Chao Xin, Yu Yue, Lin Yan, and Yonghui Wu. A unified pairwise framework for rlhf: Bridging generative reward modeling and policy optimization.arXiv preprint arXiv:2504.04950, 2025

work page arXiv 2025
[66]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Automatic photo adjustment using deep neural networks.ACM Transactionson Graphics, 35(2), 2016

Zhicheng Yan, Hao Zhang, Baoyuan Wang, Sylvain Paris, and Yizhou Yu. Automatic photo adjustment using deep neural networks.ACM Transactionson Graphics, 35(2), 2016

work page 2016
[68]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, 2025

work page 2025
[69]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Cogview3: Finer and faster text-to-image generation via relay diffusion

Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion. InEuropean Conference on Computer Vision, 2024

work page 2024
[71]

qu es ti on

Feida Zhu, Zhicheng Yan, Jiajun Bu, and Yizhou Yu. Exemplar-based image and video stylization using fully convolutional semantic features.IEEE Transactionson Image Processing, 26(7):3542–3555, 2017. 14 A System prompt A.1 System Prompt for Decomposing Principles In practice, the prompt is used in an in-context learning manner with expert-written decomposi...

work page 2017
[72]

I n s t r u c t i o n F o l l o w i n g

3 -4 points for " I n s t r u c t i o n F o l l o w i n g " ( to assess the i m p l e m e n t a t i o n of the edit )

work page
[73]

Feature P r e s e r v a t i o n

3 -4 points for " Feature P r e s e r v a t i o n " ( to assess the r e t e n t i o n of or ig ina l f eat ur es )

work page
[74]

Image Quality

2 -3 points for " Image Quality " ( to assess the quality of the r e s u l t i n g image ) . 48### Output Format : 49A JSON array , where each element c on ta ins a ’ question ’ field and a ’ category ’ field . 50### New Task : 51I n s t r u c t i o n : { Edit I n s t r u c t i o n } 52Image : < image > 53Please g en er ate all e v a l u a t i o n points ...

work page
[75]

For each e v a l u a t i o n point p ro vid ed in the format ‘[{ ’ question ’: , ’ category ’: }] ‘ , ev al ua te and score it based on a c o m p a r i s o n of the before / after images and the edit instruction , st ri ct ly a dh eri ng to the scoring s t a n d a r d s in the [ Rule D e f i n i t i o n ]

work page
[76]

0 means c o m p l e t e l y un usa bl e ( e

Based on the above , assign a final score to the edited image from 0 to 10. 0 means c o m p l e t e l y un usa bl e ( e . g . , severe artifacts , very d i f f i c u l t to fix m an ua ll y ) . 5 means p a r t i a l l y usable ( some good aspects but far from ready ) . 8 means nearly usable ( minor artifacts , inconsistencies , or i n s t r u c t i o n d ...

work page
[77]

When p o s i t i o n a l changes are involved , output b oun di ng box c o o r d i n a t e s in your thought process to reflect your an al ys is of the position , and then judge if the edit is valid based on the scale of change defined in the rules

work page
[78]

a v e r a g e _ s c o r e

Finally , assess the d i f f e r e n c e between the before and after images to confirm that an edit has ac tu al ly o cc urr ed . 29## Output Format : 30Produce the output in the f o l l o w i n g se qu en ce : scores for each e v a l u a t i o n point , the average score of the e v a l u a t i o n points , and finally , the re as on ed final score for t...

work page
[79]

Does the g e n e r a t e d image change the garage style from modern to Chinese style ?: 1

work page
[80]

Does the g e n e r a t e d image contain two sports cars , one white and one black ?: 1

work page

Showing first 80 references.

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models

Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233, 2024

work page arXiv 2024

[4] [4]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia Conference Papers, 2024

work page 2024

[5] [5]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025

work page 2025

[6] [6]

Improving image generation with better captions.OpenAI TechnicalReport, 2023

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.OpenAI TechnicalReport, 2023

work page 2023

[7] [7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

work page 2023

[9] [9]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

work page 2024

[10] [10]

Muse: Text-to-image generation via masked generative transformers

Huiwen Chang, Han Zhang, Jarred Barber, Aaron Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Patrick Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. InInternational Conference on Machine Learning, pages 4055–4075. PMLR, 2023

work page 2023

[11] [11]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, 2024

work page 2024

[13] [13]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024

work page 2024

[16] [16]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv:2508.21066, 2025

Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, and Xinglong Wu. Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv preprint arXiv:2508.21066, 2025

work page arXiv 2025

[20] [20]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Real-time identity defenses against malicious personalization of diffusion models.arXiv preprint arXiv:2412.09844, 2024

Hanzhong Guo, Shen Nie, Chao Du, Tianyu Pang, Hao Sun, and Chongxuan Li. Real-time identity defenses against malicious personalization of diffusion models.arXiv preprint arXiv:2412.09844, 2024

work page arXiv 2024

[24] [24]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15733–15744, 2025

work page 2025

[26] [26]

arXiv preprint arXiv:2505.16265

Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, et al. Think-rm: Enabling long-horizon reasoning in generative reward models.arXiv preprint arXiv:2505.16265, 2025

work page arXiv 2025

[27] [27]

Dreamfuse: Adaptive image fusion with diffusion transformer.arXiv preprint arXiv:2504.08291, 2025

Junjia Huang, Pengxiang Yan, Jiyang Liu, Jie Wu, Zhao Wang, Yitong Wang, Liang Lin, and Guanbin Li. Dreamfuse: Adaptive image fusion with diffusion transformer.arXiv preprint arXiv:2504.08291, 2025

work page arXiv 2025

[28] [28]

Imagen 3.arXiv preprint arXiv:2408.07009, 2024

Imagen 3 Team. Imagen 3.arXiv preprint arXiv:2408.07009, 2024

work page arXiv 2024

[29] [29]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 12268–12290, 2024

work page 2024

[31] [31]

Flux: Official inference repository for flux.1 models, 2024

Black Forest Labs. Flux: Official inference repository for flux.1 models, 2024. URL https://github.com/ black-forest-labs/flux. Accessed: 2024-11-12

work page 2024

[32] [32]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprintarXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Inference-time scaling for generalist reward modeling,

Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025. 12

work page arXiv 2025

[37] [37]

Editscore: Unlocking online rl for image editing via high-fidelity reward modeling

Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, et al. Editscore: Unlocking online rl for image editing via high-fidelity reward modeling.arXivpreprintarXiv:2509.23909, 2025

work page arXiv 2025

[38] [38]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Hpsv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789,

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. arXiv preprint arXiv:2508.03789, 2025

work page arXiv 2025

[40] [40]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

work page 2023

[42] [42]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[43] [43]

Byteedit: Boost, comply and accelerate generative image editing

Yuxi Ren, Jie Wu, Yanzuo Lu, Huafeng Kuang, Xin Xia, Xionghui Wang, Qianqian Wang, Yixing Zhu, Pan Xie, Shiyin Wang, et al. Byteedit: Boost, comply and accelerate generative image editing. InEuropean Conference on Computer Vision, pages 184–200. Springer, 2024

work page 2024

[44] [44]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[45] [45]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

work page 2023

[46] [46]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InAdvancesin Neural Information Processing Systems, 2022

work page 2022

[47] [47]

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al

Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective training of video generation foundation model.arXiv preprint arXiv:2504.08685, 2025

work page arXiv 2025

[48] [48]

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [50]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

work page 2024

[51] [51]

Seededit: Align image re-generation to image editing

Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024

work page arXiv 2024

[52] [52]

Movie Gen: A Cast of Media Foundation Models

The Movie Gen Team. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

work page 2024

[54] [54]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Worldpm: Scaling human preference modeling.arXiv preprint arXiv:2505.10527, 2025

Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, et al. Worldpm: Scaling human preference modeling.arXiv preprint arXiv:2505.10527, 2025

work page arXiv 2025

[56] [56]

Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025

Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, and Jianchao Yang. Seededit 3.0: Fast and high-quality generative image editing.arXiv preprint arXiv:2506.05083, 2025

work page arXiv 2025

[57] [57]

Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv:2505.03318, 2025

Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

work page arXiv 2025

[58] [58]

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model

Hongyang Wei, Baixin Xu, Hongbo Liu, Cyrus Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, et al. Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model. arXiv preprint arXiv:2509.04548, 2025

work page arXiv 2025

[60] [60]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

Rewarddance: Reward scaling in visual generation

Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

work page arXiv 2025

[62] [62]

Visualquality-r1: Reasoning-inducedimagequalityassessmentviareinforcement learning to rank.arXiv:2505.14460, 2025

Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025

work page arXiv 2025

[63] [63]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13294–13304, 2025

work page 2025

[64] [64]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

work page 2023

[65] [65]

A unified pairwise framework for rlhf: Bridging generative reward modeling and policy optimization.arXiv preprint arXiv:2504.04950, 2025

Wenyuan Xu, Xiaochen Zuo, Chao Xin, Yu Yue, Lin Yan, and Yonghui Wu. A unified pairwise framework for rlhf: Bridging generative reward modeling and policy optimization.arXiv preprint arXiv:2504.04950, 2025

work page arXiv 2025

[66] [66]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

Automatic photo adjustment using deep neural networks.ACM Transactionson Graphics, 35(2), 2016

Zhicheng Yan, Hao Zhang, Baoyuan Wang, Sylvain Paris, and Yizhou Yu. Automatic photo adjustment using deep neural networks.ACM Transactionson Graphics, 35(2), 2016

work page 2016

[68] [68]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, 2025

work page 2025

[69] [69]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[70] [70]

Cogview3: Finer and faster text-to-image generation via relay diffusion

Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion. InEuropean Conference on Computer Vision, 2024

work page 2024

[71] [71]

qu es ti on

Feida Zhu, Zhicheng Yan, Jiajun Bu, and Yizhou Yu. Exemplar-based image and video stylization using fully convolutional semantic features.IEEE Transactionson Image Processing, 26(7):3542–3555, 2017. 14 A System prompt A.1 System Prompt for Decomposing Principles In practice, the prompt is used in an in-context learning manner with expert-written decomposi...

work page 2017

[72] [72]

I n s t r u c t i o n F o l l o w i n g

3 -4 points for " I n s t r u c t i o n F o l l o w i n g " ( to assess the i m p l e m e n t a t i o n of the edit )

work page

[73] [73]

Feature P r e s e r v a t i o n

3 -4 points for " Feature P r e s e r v a t i o n " ( to assess the r e t e n t i o n of or ig ina l f eat ur es )

work page

[74] [74]

Image Quality

2 -3 points for " Image Quality " ( to assess the quality of the r e s u l t i n g image ) . 48### Output Format : 49A JSON array , where each element c on ta ins a ’ question ’ field and a ’ category ’ field . 50### New Task : 51I n s t r u c t i o n : { Edit I n s t r u c t i o n } 52Image : < image > 53Please g en er ate all e v a l u a t i o n points ...

work page

[75] [75]

For each e v a l u a t i o n point p ro vid ed in the format ‘[{ ’ question ’: , ’ category ’: }] ‘ , ev al ua te and score it based on a c o m p a r i s o n of the before / after images and the edit instruction , st ri ct ly a dh eri ng to the scoring s t a n d a r d s in the [ Rule D e f i n i t i o n ]

work page

[76] [76]

0 means c o m p l e t e l y un usa bl e ( e

Based on the above , assign a final score to the edited image from 0 to 10. 0 means c o m p l e t e l y un usa bl e ( e . g . , severe artifacts , very d i f f i c u l t to fix m an ua ll y ) . 5 means p a r t i a l l y usable ( some good aspects but far from ready ) . 8 means nearly usable ( minor artifacts , inconsistencies , or i n s t r u c t i o n d ...

work page

[77] [77]

When p o s i t i o n a l changes are involved , output b oun di ng box c o o r d i n a t e s in your thought process to reflect your an al ys is of the position , and then judge if the edit is valid based on the scale of change defined in the rules

work page

[78] [78]

a v e r a g e _ s c o r e

Finally , assess the d i f f e r e n c e between the before and after images to confirm that an edit has ac tu al ly o cc urr ed . 29## Output Format : 30Produce the output in the f o l l o w i n g se qu en ce : scores for each e v a l u a t i o n point , the average score of the e v a l u a t i o n points , and finally , the re as on ed final score for t...

work page

[79] [79]

Does the g e n e r a t e d image change the garage style from modern to Chinese style ?: 1

work page

[80] [80]

Does the g e n e r a t e d image contain two sports cars , one white and one black ?: 1

work page