pith. sign in

arxiv: 2605.25661 · v2 · pith:WDCU245Anew · submitted 2026-05-25 · 💻 cs.CV

DRM: Diffusion-based Reward Model With Step-wise Guidance

Pith reviewed 2026-06-29 22:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelreward modelpreference alignmentimage generationreinforcement learningstep-wise guidanceGRPO
0
0 comments X

The pith

A pre-trained diffusion model can serve as a reward model that scores both final images and noisy intermediate latents to align generation with human preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that because diffusion models already generate high-fidelity images, they must encode a deeper grasp of aesthetics, composition, and harmony than vision-language models trained only for semantic matching. It therefore treats the diffusion model itself as the reward backbone, allowing evaluation at every denoising step rather than only at the end. This step-wise capacity is used first to supply dense per-step rewards inside a modified GRPO reinforcement-learning procedure and second to score multiple candidate paths at each inference step. Experiments show the resulting images have measurably higher perceptual quality than those aligned with conventional VLM reward models.

Core claim

The Diffusion-based Reward Model (DRM) turns a pre-trained diffusion model into an evaluative backbone whose denoising trajectory supplies reward signals on both clean and noisy latents, enabling Step-wise GRPO to assign credit at every timestep and Step-wise Sampling to prune inferior trajectories on the fly.

What carries the argument

The Diffusion-based Reward Model (DRM), which re-uses the pre-trained diffusion model's noise-prediction network to score perceptual quality at arbitrary points along the generative trajectory.

If this is right

  • Step-wise GRPO replaces sparse terminal rewards with per-denoising-step signals, reducing credit-assignment error in reinforcement-learning alignment.
  • Step-wise Sampling lets the DRM choose among multiple latent paths at each timestep, steering the trajectory toward higher-quality endpoints.
  • The method eliminates the need for a separate VLM reward model trained on semantic tasks.
  • Final generated images exhibit improved perceptual quality across standard human-preference benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same diffusion backbone could supply reward signals for other iterative generative processes such as video or 3-D synthesis.
  • Training separate reward models might become unnecessary if the generator itself already encodes the relevant perceptual criteria.
  • The approach invites direct comparison between diffusion-based and flow-matching-based reward signals on identical tasks.

Load-bearing premise

A model that can generate high-fidelity images must already possess a deep internal understanding of aesthetics, composition, and visual harmony.

What would settle it

A controlled test in which the same diffusion model, when used as DRM, assigns higher scores to images that human raters judge lower in aesthetics or composition than to images the raters judge higher.

Figures

Figures reproduced from arXiv: 2605.25661 by Binxin Yang, Chen Li, Hubery Yin, Jaxon Zhang, Jing Lyu.

Figure 1
Figure 1. Figure 1: Comparison between preview reward models and DRM. Existing reward models treat the generation process as a black box, providing only a single, terminal reward based on the final output. Our DRM offers fine-grained reward for any noisy latent along the entire denoising trajectory. 0 100 200 300 400 500 600 Step 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Diffusion-based Reward Model Step-GRPO GRPO 2.5xEfficiency [PITH… view at source ↗
Figure 2
Figure 2. Figure 2: Reward curves for various RL algorithms optimized using our DRM. Our Step-GRPO, which leverages dense, per￾step rewards, not only reaches a higher final reward but also con￾verges 2.5x faster on step than the standard GRPO baseline. learned reward function can then be used to generate syn￾thetic preference data, significantly reducing the reliance on manual annotation for model alignment. Early RMs [19, 43… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Diffusion-based Reward Model (DRM). (Left) The training pipeline. During training, the DRM takes a pair of preferred and dispreferred images, both corrupted with noise at a specific timestep t, and predicts their respective reward scores. The model is then optimized via DRM loss. (Right) The detailed architecture of our Reward Output Head. . . . . . . SDE Initilize K individual start point … view at source ↗
Figure 4
Figure 4. Figure 4: GRPO vs. Step-wise GRPO. (Left) Naive GRPO relies on a terminal reward. It samples multiple full trajectories, calculates a single reward at the final step (t=0), and applies this coarse reward uniformly to all preceding steps, leading to imprecise credit assignment. (Right) Step-wise GRPO introduces a dense, per-step reward signal. From a single initial point, it explores k candidate samples via SDE at ea… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of Step-wise Sampling. At each step t, we perform a branching into k candidates via SDE. The DRM scores these candidates, and the top-scoring latent is chosen to continue the trajectory. a training-free, plug-and-play mechanism for improving model outputs. This approach provides a highly practi￾cal method for users to boost generation quality without any model fine-tuning. Conventional determinist… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of SD3.5-Medium optimized by various reward models. Our approach clearly exhibits superior visual quality compared to the competing methods. Model ImageReward PickScore HPSv3 SD3.5-Medium 1.01 16.76 8.95 + PickScore & GRPO 1.14 16.94 9.64 + HPSv3 & GRPO 1.15 16.90 9.71 + DRM & GRPO 1.14 16.95 10.07 + DRM & Step-GRPO 1.17 17.04 10.28 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reward curves with steps and GPU hours as the x-axis. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Step-wise Sampling enhances both the fidelity to the [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
read the original abstract

Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models. However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities-such as aesthetics, composition, and visual harmony. In this work, we argue that a model capable of high-fidelity generation must possess a profound understanding of these visual attributes. Based on this insight, we introduce the Diffusion-based Reward Model (DRM), a novel paradigm that use the pre-trained diffusion model as a powerful evaluative backbone. A key advantage of the DRM is its unique ability to assess not only the final image but also the noisy intermediate latents at any stage of the generative process. We leverage this step-wise evaluative capacity in two ways. First, we propose Step-wise GRPO, a reinforcement learning algorithm that provides dense, per-step rewards to resolve the imprecise credit assignment problem in GRPO algorithm, leading to more stable and effective alignment. Second, we introduce Step-wise Sampling, a novel inference strategy that employs the DRM as a dynamic guide to evaluate multiple generation paths at each step, steering the process towards higher-quality outcomes. Extensive experiments confirm that our approach significantly enhances the final quality of generated images. Code: https://github.com/jjaxonx/DRM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Diffusion-based Reward Model (DRM), which repurposes a pre-trained diffusion model as a reward backbone capable of evaluating both final images and intermediate noisy latents. This step-wise capacity is used to define Step-wise GRPO (a dense-reward variant of GRPO) for RL-based alignment and Step-wise Sampling (a multi-path guidance strategy at inference). The central motivation is that diffusion models' high-fidelity generation implies they already encode the perceptual attributes (aesthetics, composition, visual harmony) needed for reward modeling, offering an alternative to VLM-based rewards.

Significance. If the core assumption holds and the diffusion backbone's representations can be shown to correlate with human perceptual judgments independently of the generation task, DRM would supply a new, architecture-native reward signal for diffusion alignment. The step-wise formulation could address credit assignment in RL and enable dynamic guidance during sampling, potentially improving stability and final image quality over current VLM-based pipelines.

major comments (2)
  1. [Abstract, paragraph 2] Abstract, paragraph 2: The load-bearing claim that 'a model capable of high-fidelity generation must possess a profound understanding of these visual attributes' is asserted without any supporting analysis, correlation study, or ablation. No evidence is given that the denoising objective produces features that reliably distinguish or score aesthetics/composition/visual harmony beyond what is already required for generation. This directly underpins both the superiority claim versus VLM rewards and the utility of Step-wise GRPO and Step-wise Sampling.
  2. [Abstract] Abstract: The manuscript states that 'Extensive experiments confirm that our approach significantly enhances the final quality of generated images,' yet provides no quantitative results, baselines, ablations, or human-study details. Without these, it is impossible to assess whether the reported gains are attributable to the DRM formulation or to other factors.
minor comments (1)
  1. [Abstract] Abstract: grammatical error ('that use the pre-trained' should be 'that uses the pre-trained').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below and will revise the abstract accordingly to better support our claims with evidence from the full manuscript.

read point-by-point responses
  1. Referee: [Abstract, paragraph 2] Abstract, paragraph 2: The load-bearing claim that 'a model capable of high-fidelity generation must possess a profound understanding of these visual attributes' is asserted without any supporting analysis, correlation study, or ablation. No evidence is given that the denoising objective produces features that reliably distinguish or score aesthetics/composition/visual harmony beyond what is already required for generation. This directly underpins both the superiority claim versus VLM rewards and the utility of Step-wise GRPO and Step-wise Sampling.

    Authors: We agree the abstract presents the claim without immediate supporting data. The full manuscript includes ablations, correlation analyses with perceptual metrics, and human preference studies showing that diffusion features from the denoising process align with aesthetics and composition judgments beyond generation requirements alone. We will revise the abstract to include a brief reference to these results. revision: yes

  2. Referee: [Abstract] Abstract: The manuscript states that 'Extensive experiments confirm that our approach significantly enhances the final quality of generated images,' yet provides no quantitative results, baselines, ablations, or human-study details. Without these, it is impossible to assess whether the reported gains are attributable to the DRM formulation or to other factors.

    Authors: We acknowledge that the abstract summarizes the outcome without specific metrics or protocol details. The full manuscript reports quantitative gains against baselines, ablations isolating the DRM contribution, and human studies. We will revise the abstract to incorporate key quantitative results and evaluation details. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents DRM as a new paradigm grounded in an explicit argumentative insight (high-fidelity generation implies perceptual understanding), followed by two algorithmic proposals (Step-wise GRPO and Step-wise Sampling). No equations, fitted parameters, or predictions are shown to reduce by construction to inputs. No self-citations appear as load-bearing justifications for uniqueness or ansatzes. The derivation chain is self-contained and does not match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the domain assumption that diffusion models encode perceptual qualities.

axioms (1)
  • domain assumption A model capable of high-fidelity generation must possess a profound understanding of visual attributes such as aesthetics, composition, and visual harmony.
    Invoked in abstract paragraph 2 as the justification for using the diffusion model as reward backbone.

pith-pipeline@v0.9.1-grok · 5760 in / 1300 out tokens · 31842 ms · 2026-06-29T22:10:27.760705+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 26 canonical work pages · 15 internal anchors

  1. [1]

    com / black - forest - labs/flux/

    Flux.https : / / github . com / black - forest - labs/flux/. 2

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2

  4. [4]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning.arXiv preprint arXiv:2305.13301, 2023. 3

  5. [5]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons.Biometrika, 39(3/4):324–345, 1952. 3

  6. [6]

    Artimuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding

    Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, et al. Artimuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding. arXiv preprint arXiv:2507.14533, 2025. 1

  7. [7]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, 2024. 1, 2

  8. [8]

    Videoscore: Building automatic met- rics to simulate fine-grained human feedback for video gen- eration

    Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic met- rics to simulate fine-grained human feedback for video gen- eration. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123,

  9. [9]

    TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

    Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324, 2025. 2

  10. [10]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 2

  11. [11]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 2

  12. [12]

    T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image genera- tion.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image genera- tion.Advances in Neural Information Processing Systems, 36:78723–78747, 2023. 2

  13. [13]

    Vbench: Comprehensive bench- mark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 2

  14. [14]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023. 6

  15. [15]

    Aligning Text-to-Image Models using Human Feedback

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text- to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023. 1

  16. [16]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow- based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025. 2

  17. [17]

    Q-insight: Understanding im- age quality via visual reinforcement learning.arXiv preprint arXiv:2503.22679, 2025

    Weiqi Li, Xuanyu Zhang, Shijie Zhao, Yabin Zhang, Junlin Li, Li Zhang, and Jian Zhang. Q-insight: Understanding im- age quality via visual reinforcement learning.arXiv preprint arXiv:2503.22679, 2025. 1, 2

  18. [18]

    Branchgrpo: Stable and efficient grpo with structured branching in diffusion models

    Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models. arXiv preprint arXiv:2509.06040, 2025. 2

  19. [19]

    Rich human feedback for text-to-image generation

    Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, et al. Rich human feedback for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19401–19411, 2024. 1, 2

  20. [20]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3

  21. [21]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

  22. [22]

    Diversegrpo: Mitigating mode collapse in image generation via diversity-aware grpo.arXiv preprint arXiv:2512.21514, 2025

    Henglin Liu, Huijuan Huang, Jing Wang, Chang Liu, Xiu Li, and Xiangyang Ji. Diversegrpo: Mitigating mode collapse in image generation via diversity-aware grpo.arXiv preprint arXiv:2512.21514, 2025. 2

  23. [23]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 2, 3

  24. [24]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025. 1, 2

  25. [25]

    Videodpo: Omni- preference alignment for video diffusion generation

    Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 1

  26. [26]

    Evalcrafter: Benchmarking and eval- uating large video generation models

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and eval- uating large video generation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22139–22149, 2024. 2

  27. [27]

    Re- thinking cross-modal interaction in multimodal diffusion transformers

    Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, and Kwan-Yee K Wong. Re- thinking cross-modal interaction in multimodal diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5934–5943, 2025. 2

  28. [28]

    Hpsv3: Towards wide-spectrum human preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025. 6

  29. [29]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 2

  30. [30]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1, 2

  31. [31]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 6

  32. [32]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023. 2

  33. [33]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2

  34. [34]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 1, 2

  35. [35]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 6

  36. [36]

    Diffusion model align- ment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024. 1, 2

  37. [37]

    Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319,

    Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xin- tao Wang, et al. Grpo-guard: Mitigating implicit over- optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025. 2

  38. [38]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 2

  39. [39]

    Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024

    Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024. 1, 2

  40. [40]

    Unified multimodal chain-of-thought reward model through reinforcement fine- tuning.arXiv preprint arXiv:2505.03318, 2025

    Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine- tuning.arXiv preprint arXiv:2505.03318, 2025. 1, 2

  41. [41]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

  42. [42]

    Better aligning text-to-image models with human preference.arXiv preprint arXiv:2303.14420, 1(3), 2023

    Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hong- sheng Li. Better aligning text-to-image models with human preference.arXiv preprint arXiv:2303.14420, 1(3), 2023. 6

  43. [43]

    Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 1, 2, 6

  44. [44]

    VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

    Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shu- run Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024. 1, 2

  45. [45]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025. 2

  46. [46]

    Using human feedback to fine-tune diffusion models without any reward model

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941– 8951, 2024. 2

  47. [47]

    A simple yet effective multi-modal reward model.arXiv preprint arXiv:2501.12368, 1(2), 2025

    Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, et al. A simple yet effective multi-modal reward model.arXiv preprint arXiv:2501.12368, 1(2), 2025. 1

  48. [48]

    Alignedgen: Aligning style across gener- ated images.Advances in Neural Information Processing Systems, 38:98168–98192, 2026

    Jiexuan Zhang, Yiheng Du, Qian Wang, Weiqi Li, Yu Gu, and Jian Zhang. Alignedgen: Aligning style across gener- ated images.Advances in Neural Information Processing Systems, 38:98168–98192, 2026. 2

  49. [49]

    Learning multi- dimensional human preference for text-to-image generation

    Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingt- ing Gao, Di Zhang, and Zhongyuan Wang. Learning multi- dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8018–8027, 2024. 1, 2, 6

  50. [50]

    Diffusion model as a noise-aware latent reward model for step-level preference optimization.arXiv preprint arXiv:2502.01051, 2025

    Tao Zhang, Cheng Da, Kun Ding, Huan Yang, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, and Chun- hong Pan. Diffusion model as a noise-aware latent reward model for step-level preference optimization.arXiv preprint arXiv:2502.01051, 2025. 2

  51. [51]

    Vq-insight: Teaching vlms for ai-generated video quality understanding via progressive visual reinforce- ment learning.arXiv preprint arXiv:2506.18564, 2025

    Xuanyu Zhang, Weiqi Li, Shijie Zhao, Junlin Li, Li Zhang, and Jian Zhang. Vq-insight: Teaching vlms for ai-generated video quality understanding via progressive visual reinforce- ment learning.arXiv preprint arXiv:2506.18564, 2025. 1, 2 DRM: Diffusion-based Reward Model With Step-wise Guidance Supplementary Material a photo of a black kite and a green bea...

  52. [52]

    This bench- mark contains 553 prompts designed to test compositional understanding, including object counting, spatial relation- ships, and attribute binding

    Result of GenEval To quantitatively assess text-image alignment, we evalu- ate our method on the GenEval benchmark. This bench- mark contains 553 prompts designed to test compositional understanding, including object counting, spatial relation- ships, and attribute binding. We apply our Step-GRPO to the SD3.5-M model and compare it against several strong ...

  53. [53]

    More Visualization Result The prompts in Figure S2 are as follows:

  54. [54]

    16-year-old teenager wearing a white bear-ear hat with a smirk on their face

  55. [55]

    photo of well done salmon dinner, 8K, Global Il- lumination, Ray Tracing Reflections

  56. [56]

    A lemon with a McDonald’s hat

  57. [57]

    The image is a mixed media collage with broken glass and torn paper elements, featuring intricate oil details and a canvas texture, in a contemporary art style

  58. [58]

    Kiwi fruit, mint leaves, ice cubes, background yellow, splashing water, soft box, back light, cre- ative food photography, Art by Alberto Seveso,

  59. [59]

    little tiny cub beautiful light color White fox soft fur kawaii chibi Walt Disney style, beautiful smiley face and beautiful eyes sweet and smiling features, snuggled in its soft and soft pastel pink cover, mag- ical light background, style Thomas kinkade Nadja Baxter Anne Stokes Nancy Noel realistic

  60. [60]

    185764, ink art, Calligraphy, bamboo plant :: or- ange, teal, white, black –ar 2:3 –uplight

  61. [61]

    The sunglasses have a deep black frame with bright pink lenses

    A 3D Rendering of a cockatoo wearing sun- glasses. The sunglasses have a deep black frame with bright pink lenses. Fashion photography, volu- metric lighting, CG rendering

  62. [62]

    A rock formation in the shape of a horse, in- sanely detailed

  63. [63]

    a desert in a snowglobe, 4k, octane render :: cinematic –ar 2048:858

  64. [64]

    watercolour beaver with tale, white background Model Overall↑ Single Obj.↑Two Obj.↑Counting↑Colors↑Position↑Attr. Binding↑ Flow Matching Models FLUX.1 Dev 0.66 0.98 0.81 0.74 0.79 0.22 0.45 SD3.5-L 0.71 0.98 0.89 0.73 0.83 0.34 0.47 SD3.5-M 0.63 0.98 0.78 0.50 0.81 0.24 0.52 GRPO based Methods SD3.5-M+Step-GRPO 0.78 0.99 0.93 0.80 0.86 0.37 0.70 Table S1....