pith. sign in

arxiv: 2510.16888 · v3 · pith:KFV5LQB4new · submitted 2025-10-19 · 💻 cs.CV

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Pith reviewed 2026-05-21 17:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords image editingdiffusion modelspolicy optimizationmultimodal large language modelsfinetuninginstruction-based editingreinforcement learning
0
0 comments X

The pith

A post-training framework for instruction-based image editing uses diffusion negative-aware finetuning and MLLM logit feedback to reach state-of-the-art benchmark scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Instruction-based image editing models trained only with supervised fine-tuning tend to overfit to specific patterns and fail to generalize to new instructions. The paper presents Edit-R1, a policy optimization approach that applies Diffusion Negative-aware Finetuning to enable likelihood-free updates aligned with flow matching and support for higher-order samplers. It further treats a multimodal large language model as a training-free reward source by extracting fine-grained signals from its output logits, stabilized through low-variance group filtering. The resulting UniWorld-V2 model records new top scores on ImgEdit and GEdit-Bench and produces clear gains when applied to multiple different base models.

Core claim

The central claim is that Diffusion Negative-aware Finetuning provides a consistent, likelihood-free policy optimization method for diffusion-based editing, while an MLLM supplies reliable implicit feedback through its logits and low-variance group filtering reduces scoring variance, together yielding a model-agnostic post-training recipe that lifts editing performance beyond supervised fine-tuning alone.

What carries the argument

Edit-R1 post-training framework built on Diffusion Negative-aware Finetuning (DiffusionNFT) for policy optimization and an MLLM used as a unified training-free reward model via output logits with low-variance group filtering.

If this is right

  • The same framework produces substantial gains when applied to different base models including Qwen-Image-Edit and FLUX-Kontext.
  • Training can use higher-order samplers because the optimization remains consistent with the flow matching forward process.
  • A single MLLM serves as reward model for many different editing instructions without task-specific training.
  • The approach reduces overfitting to annotated patterns and improves generalization outside the training distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar post-training could lower the amount of human-annotated editing pairs needed to reach high performance.
  • The logit-based feedback and group filtering technique might transfer to other diffusion or flow-based generative tasks.
  • Low-variance filtering of noisy LLM signals could stabilize reinforcement learning loops in additional multimodal settings.

Load-bearing premise

The multimodal large language model supplies reliable, unbiased fine-grained feedback on editing quality through its output logits across varied instructions.

What would settle it

Applying the full Edit-R1 procedure to a base model such as FLUX-Kontext and measuring no improvement or a drop relative to standard supervised fine-tuning on the ImgEdit benchmark would falsify the claimed gains.

read the original abstract

Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. \texttt{UniWorld-V2}, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available to support further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Edit-R1, a post-training framework for instruction-based image editing. It proposes Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method aligned with the flow matching forward process, and uses a Multimodal Large Language Model (MLLM) as a training-free reward model via its output logits, augmented by a low-variance group filtering mechanism to reduce scoring noise. The resulting UniWorld-V2 model is reported to achieve state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench, with the framework shown to be model-agnostic and to deliver gains on base models including Qwen-Image-Edit and FLUX-Kontext. Code and models are released publicly.

Significance. If the performance claims and underlying assumptions hold after validation, the work would represent a meaningful contribution to post-training of diffusion-based image editors by enabling exploration beyond supervised fine-tuning distributions through policy optimization. The model-agnostic design and public code release are strengths that support broader applicability and reproducibility in the computer vision community.

major comments (3)
  1. [§4 (Experiments)] §4 (Experiments): The SOTA claims rest on benchmark scores of 4.49 and 7.83 without reported error bars, multiple random seeds, or statistical significance tests against baselines; this makes it impossible to determine whether the gains from DiffusionNFT and MLLM-driven optimization are robust or could be explained by variance in evaluation.
  2. [§3.2 (MLLM Reward Model)] §3.2 (MLLM Reward Model): The central assumption that MLLM output logits provide fine-grained, unbiased feedback correlating with editing success across diverse instructions lacks supporting validation such as correlation with human judgments or ablation on logit calibration; without this, the reward signal's reliability for policy optimization remains unverified and could bias the training trajectory.
  3. [§3.3 (Low-variance Group Filtering)] §3.3 (Low-variance Group Filtering): The filtering mechanism is claimed to reduce noise while preserving the optimization trajectory, yet no analysis is provided on whether it systematically excludes higher-variance (potentially harder or more diverse) edits, which would alter the effective data distribution and risk inflating benchmark scores without true generalization improvement.
minor comments (2)
  1. [Abstract] The abstract states 'substantial performance gains' on base models but provides no quantitative deltas; adding these numbers would improve precision.
  2. [§3.1] Notation for the DiffusionNFT objective could benefit from an explicit equation reference when first introduced to aid readers unfamiliar with flow-matching consistency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of experimental rigor and validation that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments): The SOTA claims rest on benchmark scores of 4.49 and 7.83 without reported error bars, multiple random seeds, or statistical significance tests against baselines; this makes it impossible to determine whether the gains from DiffusionNFT and MLLM-driven optimization are robust or could be explained by variance in evaluation.

    Authors: We agree that reporting variability and statistical significance would strengthen the claims. The reported scores reflect the best single-run results obtained during development, but we have since performed additional training runs with three different random seeds for the key configurations. In the revised manuscript we will report mean and standard deviation for the main benchmarks and include paired statistical tests against the strongest baselines. revision: yes

  2. Referee: [§3.2 (MLLM Reward Model)] §3.2 (MLLM Reward Model): The central assumption that MLLM output logits provide fine-grained, unbiased feedback correlating with editing success across diverse instructions lacks supporting validation such as correlation with human judgments or ablation on logit calibration; without this, the reward signal's reliability for policy optimization remains unverified and could bias the training trajectory.

    Authors: The use of MLLM logits is motivated by their ability to provide instruction-aware, continuous signals without additional training. While the original submission did not contain an explicit human correlation study, we will add a targeted validation: we sample a held-out set of 200 edits, collect human preference ratings, and report Spearman correlation between MLLM logit scores and human judgments. We will also include an ablation comparing raw logits versus calibrated or layer-specific variants. revision: yes

  3. Referee: [§3.3 (Low-variance Group Filtering)] §3.3 (Low-variance Group Filtering): The filtering mechanism is claimed to reduce noise while preserving the optimization trajectory, yet no analysis is provided on whether it systematically excludes higher-variance (potentially harder or more diverse) edits, which would alter the effective data distribution and risk inflating benchmark scores without true generalization improvement.

    Authors: The group filtering selects batches with low intra-group score variance to stabilize the policy gradient estimate. To examine possible distributional shift, we will add an analysis in the revision that compares instruction complexity, image diversity metrics, and edit difficulty proxies before and after filtering. Any observed bias will be quantified and discussed, together with an ablation that relaxes the variance threshold. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SOTA claims rest on external benchmarks

full rationale

The paper proposes an empirical post-training framework (DiffusionNFT policy optimization with MLLM logit rewards and low-variance filtering) and reports direct performance numbers on independent external benchmarks (ImgEdit 4.49, GEdit-Bench 7.83). No equation or derivation reduces these scores to a fitted parameter, self-referential quantity, or self-citation chain by construction. The central claims are model-agnostic gains demonstrated via standard training and evaluation, with no load-bearing step that collapses to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that MLLM logits constitute a valid reward signal and that group filtering stabilizes training without side effects; no new physical entities are postulated and free parameters are limited to standard training hyperparameters.

free parameters (1)
  • group size and variance threshold for filtering
    Hyperparameters chosen to reduce MLLM scoring noise; their specific values are not stated in the abstract but affect optimization stability.
axioms (1)
  • domain assumption MLLM output logits provide fine-grained, training-free feedback that correlates with editing quality across diverse instructions
    Invoked when the paper positions the MLLM as a unified reward model without additional training or calibration.

pith-pipeline@v0.9.0 · 5840 in / 1409 out tokens · 76594 ms · 2026-05-21T17:55:13.116001+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process... leveraging its output logits to provide fine-grained feedback... low-variance group filtering mechanism

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    GenEvolve proposes a self-evolving agent framework for open-ended image generation that uses tool-orchestrated trajectories and visual experience distillation from best-worst differences to achieve reported state-of-t...

  2. Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

  3. Inline Critic Steers Image Editing

    cs.CV 2026-05 conditional novelty 7.0

    Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.

  4. RewardHarness: Self-Evolving Agentic Post-Training

    cs.AI 2026-05 unverdicted novelty 7.0

    RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

  5. Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

  6. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

  7. UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.

  8. DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

    cs.CV 2026-02 unverdicted novelty 7.0

    DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in ...

  9. Setting the Stage: Text-Driven Scene-Consistent Image Generation

    cs.CV 2025-12 conditional novelty 7.0

    A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.

  10. WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.

  11. Semantic Generative Tuning for Unified Multimodal Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.

  12. Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

    cs.CV 2026-05 unverdicted novelty 6.0

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...

  13. Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

    cs.AI 2026-05 unverdicted novelty 6.0

    Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...

  14. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 6.0

    UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.

  15. SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.

  16. Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

    cs.CV 2025-12 conditional novelty 6.0

    Scone unifies subject understanding and generation in a two-stage trained model to improve both composition and distinction in multi-subject image generation, outperforming prior open-source models on new benchmarks.

  17. Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing

    cs.CV 2026-05 unverdicted novelty 5.0

    Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.

  18. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  19. Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

    cs.LG 2026-05 unverdicted novelty 5.0

    Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.

  20. SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

    cs.CV 2026-04 unverdicted novelty 5.0

    SmartPhotoCrafter performs automatic photographic image editing by coupling an Image Critic module that identifies deficiencies with a Photographic Artist module that generates edits, trained via multi-stage pretraini...

  21. Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    cs.CV 2025-11 unverdicted novelty 5.0

    Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...

  22. JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.

  23. JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 21 Pith papers · 17 internal anchors

  1. [1]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,

  2. [2]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

  3. [3]

    Honestllm: Toward an honest and helpful large language model

    Chujie Gao, Siyuan Wu, Yue Huang, Dongping Chen, Qihui Zhang, Zhengyan Fu, Yao Wan, Lichao Sun, and Xiangliang Zhang. Honestllm: Toward an honest and helpful large language model. arXiv preprint arXiv:2406.00380,

  4. [4]

    Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv preprint arXiv:2508.21066,

    12 Technical Report Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, and Xinglong Wu. Onereward: Unified mask-guided image generation via multi-task human preference learning.arXiv preprint arXiv:2508.21066,

  5. [5]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626,

  6. [6]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025a

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025a. Yi Huang, Jiancheng Huang, Yifan Liu, Mingfu Yan, J...

  7. [7]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    URLhttps://arxiv.org/abs/2506.15742. Yi-Chen Li, Tian Xu, Yang Yu, Xuqin Zhang, Xiong-Hui Chen, Zhongxiang Ling, Ningjing Chao, Lei Yuan, and Zhi-Hua Zhou. Generalist reward models: Found inside large language models. arXiv preprint arXiv:2506.23235,

  8. [8]

    Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748,

  9. [9]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147,

  10. [10]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  11. [11]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025a. Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A pract...

  12. [12]

    Cot-lized diffusion: Let’s reinforce t2i generation step-by-step.arXiv preprint arXiv:2507.04451, 2025c

    13 Technical Report Zheyuan Liu, Munan Ning, Qihui Zhang, Shuo Yang, Zhongrui Wang, Yiwei Yang, Xianzhe Xu, Yibing Song, Weihua Chen, Fan Wang, et al. Cot-lized diffusion: Let’s reinforce t2i generation step-by-step.arXiv preprint arXiv:2507.04451, 2025c. Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver...

  13. [13]

    Editscore: Unlocking online rl for image editing via high-fidelity reward modeling

    Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, et al. Editscore: Unlocking online rl for image editing via high-fidelity reward modeling. arXiv preprint arXiv:2509.23909,

  14. [14]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073,

  15. [15]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

  16. [16]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.arXiv preprint arXiv:2402.03300,

  17. [17]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

  18. [18]

    Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

    14 Technical Report Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025a. Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multi...

  19. [19]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025a. Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced ...

  20. [20]

    Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025c

    Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025c. Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, and Wenhu Chen. Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv pr...

  21. [21]

    VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

    Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human prefer- ence learning for image and video generation.arXiv preprint arXiv:2412.21059,

  22. [22]

    Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

    Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025a. Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual gener...

  23. [23]

    Asft: Anchoring safety during llm fine-tuning within narrow safety basin.arXiv preprint arXiv:2506.08473,

    Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, et al. Asft: Anchoring safety during llm fine-tuning within narrow safety basin.arXiv preprint arXiv:2506.08473,

  24. [24]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721,

  25. [25]

    Reinforcement learning with inverse rewards for world model post-training.arXiv preprint arXiv:2509.23958, 2025a

    Yang Ye, Tianyu He, Shuo Yang, and Jiang Bian. Reinforcement learning with inverse rewards for world model post-training.arXiv preprint arXiv:2509.23958, 2025a. Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025b. Qifan Y...

  26. [26]

    Lex-art: Rethinking text generation via scalable high-quality data synthesis.arXiv preprint arXiv:2503.21749,

    Shitian Zhao, Qilong Wu, Xinyue Li, Bo Zhang, Ming Li, Qi Qin, Dongyang Liu, Kaipeng Zhang, Hongsheng Li, Yu Qiao, et al. Lex-art: Rethinking text generation via scalable high-quality data synthesis.arXiv preprint arXiv:2503.21749,

  27. [27]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117,

  28. [28]

    Score Logit

    Our proposed method, “Score Logit”, which utilizes the expected value of score logits, achieves a pairwise accuracy of 74.74%. This result significantly surpasses all other baseline methods, including binary classification-based rewards and those using discrete scores. This demonstrates that our continuous reward signal is more effective at capturing the ...