DRM: Diffusion-based Reward Model With Step-wise Guidance
Pith reviewed 2026-06-29 22:10 UTC · model grok-4.3
The pith
A pre-trained diffusion model can serve as a reward model that scores both final images and noisy intermediate latents to align generation with human preferences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Diffusion-based Reward Model (DRM) turns a pre-trained diffusion model into an evaluative backbone whose denoising trajectory supplies reward signals on both clean and noisy latents, enabling Step-wise GRPO to assign credit at every timestep and Step-wise Sampling to prune inferior trajectories on the fly.
What carries the argument
The Diffusion-based Reward Model (DRM), which re-uses the pre-trained diffusion model's noise-prediction network to score perceptual quality at arbitrary points along the generative trajectory.
If this is right
- Step-wise GRPO replaces sparse terminal rewards with per-denoising-step signals, reducing credit-assignment error in reinforcement-learning alignment.
- Step-wise Sampling lets the DRM choose among multiple latent paths at each timestep, steering the trajectory toward higher-quality endpoints.
- The method eliminates the need for a separate VLM reward model trained on semantic tasks.
- Final generated images exhibit improved perceptual quality across standard human-preference benchmarks.
Where Pith is reading between the lines
- The same diffusion backbone could supply reward signals for other iterative generative processes such as video or 3-D synthesis.
- Training separate reward models might become unnecessary if the generator itself already encodes the relevant perceptual criteria.
- The approach invites direct comparison between diffusion-based and flow-matching-based reward signals on identical tasks.
Load-bearing premise
A model that can generate high-fidelity images must already possess a deep internal understanding of aesthetics, composition, and visual harmony.
What would settle it
A controlled test in which the same diffusion model, when used as DRM, assigns higher scores to images that human raters judge lower in aesthetics or composition than to images the raters judge higher.
Figures
read the original abstract
Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models. However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities-such as aesthetics, composition, and visual harmony. In this work, we argue that a model capable of high-fidelity generation must possess a profound understanding of these visual attributes. Based on this insight, we introduce the Diffusion-based Reward Model (DRM), a novel paradigm that use the pre-trained diffusion model as a powerful evaluative backbone. A key advantage of the DRM is its unique ability to assess not only the final image but also the noisy intermediate latents at any stage of the generative process. We leverage this step-wise evaluative capacity in two ways. First, we propose Step-wise GRPO, a reinforcement learning algorithm that provides dense, per-step rewards to resolve the imprecise credit assignment problem in GRPO algorithm, leading to more stable and effective alignment. Second, we introduce Step-wise Sampling, a novel inference strategy that employs the DRM as a dynamic guide to evaluate multiple generation paths at each step, steering the process towards higher-quality outcomes. Extensive experiments confirm that our approach significantly enhances the final quality of generated images. Code: https://github.com/jjaxonx/DRM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Diffusion-based Reward Model (DRM), which repurposes a pre-trained diffusion model as a reward backbone capable of evaluating both final images and intermediate noisy latents. This step-wise capacity is used to define Step-wise GRPO (a dense-reward variant of GRPO) for RL-based alignment and Step-wise Sampling (a multi-path guidance strategy at inference). The central motivation is that diffusion models' high-fidelity generation implies they already encode the perceptual attributes (aesthetics, composition, visual harmony) needed for reward modeling, offering an alternative to VLM-based rewards.
Significance. If the core assumption holds and the diffusion backbone's representations can be shown to correlate with human perceptual judgments independently of the generation task, DRM would supply a new, architecture-native reward signal for diffusion alignment. The step-wise formulation could address credit assignment in RL and enable dynamic guidance during sampling, potentially improving stability and final image quality over current VLM-based pipelines.
major comments (2)
- [Abstract, paragraph 2] Abstract, paragraph 2: The load-bearing claim that 'a model capable of high-fidelity generation must possess a profound understanding of these visual attributes' is asserted without any supporting analysis, correlation study, or ablation. No evidence is given that the denoising objective produces features that reliably distinguish or score aesthetics/composition/visual harmony beyond what is already required for generation. This directly underpins both the superiority claim versus VLM rewards and the utility of Step-wise GRPO and Step-wise Sampling.
- [Abstract] Abstract: The manuscript states that 'Extensive experiments confirm that our approach significantly enhances the final quality of generated images,' yet provides no quantitative results, baselines, ablations, or human-study details. Without these, it is impossible to assess whether the reported gains are attributable to the DRM formulation or to other factors.
minor comments (1)
- [Abstract] Abstract: grammatical error ('that use the pre-trained' should be 'that uses the pre-trained').
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We address each point below and will revise the abstract accordingly to better support our claims with evidence from the full manuscript.
read point-by-point responses
-
Referee: [Abstract, paragraph 2] Abstract, paragraph 2: The load-bearing claim that 'a model capable of high-fidelity generation must possess a profound understanding of these visual attributes' is asserted without any supporting analysis, correlation study, or ablation. No evidence is given that the denoising objective produces features that reliably distinguish or score aesthetics/composition/visual harmony beyond what is already required for generation. This directly underpins both the superiority claim versus VLM rewards and the utility of Step-wise GRPO and Step-wise Sampling.
Authors: We agree the abstract presents the claim without immediate supporting data. The full manuscript includes ablations, correlation analyses with perceptual metrics, and human preference studies showing that diffusion features from the denoising process align with aesthetics and composition judgments beyond generation requirements alone. We will revise the abstract to include a brief reference to these results. revision: yes
-
Referee: [Abstract] Abstract: The manuscript states that 'Extensive experiments confirm that our approach significantly enhances the final quality of generated images,' yet provides no quantitative results, baselines, ablations, or human-study details. Without these, it is impossible to assess whether the reported gains are attributable to the DRM formulation or to other factors.
Authors: We acknowledge that the abstract summarizes the outcome without specific metrics or protocol details. The full manuscript reports quantitative gains against baselines, ablations isolating the DRM contribution, and human studies. We will revise the abstract to incorporate key quantitative results and evaluation details. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper presents DRM as a new paradigm grounded in an explicit argumentative insight (high-fidelity generation implies perceptual understanding), followed by two algorithmic proposals (Step-wise GRPO and Step-wise Sampling). No equations, fitted parameters, or predictions are shown to reduce by construction to inputs. No self-citations appear as load-bearing justifications for uniqueness or ansatzes. The derivation chain is self-contained and does not match any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A model capable of high-fidelity generation must possess a profound understanding of visual attributes such as aesthetics, composition, and visual harmony.
Reference graph
Works this paper leans on
-
[1]
com / black - forest - labs/flux/
Flux.https : / / github . com / black - forest - labs/flux/. 2
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning.arXiv preprint arXiv:2305.13301, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Rank analysis of incomplete block designs: I
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons.Biometrika, 39(3/4):324–345, 1952. 3
1952
-
[6]
Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, et al. Artimuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding. arXiv preprint arXiv:2507.14533, 2025. 1
-
[7]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, 2024. 1, 2
2024
-
[8]
Videoscore: Building automatic met- rics to simulate fine-grained human feedback for video gen- eration
Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic met- rics to simulate fine-grained human feedback for video gen- eration. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123,
2024
-
[9]
TempFlow-GRPO: When Timing Matters for GRPO in Flow Models
Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 2
2017
-
[11]
Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 2
2020
-
[12]
T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image genera- tion.Advances in Neural Information Processing Systems, 36:78723–78747, 2023
Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image genera- tion.Advances in Neural Information Processing Systems, 36:78723–78747, 2023. 2
2023
-
[13]
Vbench: Comprehensive bench- mark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 2
2024
-
[14]
Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023. 6
2023
-
[15]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text- to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow- based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Weiqi Li, Xuanyu Zhang, Shijie Zhao, Yabin Zhang, Junlin Li, Li Zhang, and Jian Zhang. Q-insight: Understanding im- age quality via visual reinforcement learning.arXiv preprint arXiv:2503.22679, 2025. 1, 2
-
[18]
Branchgrpo: Stable and efficient grpo with structured branching in diffusion models
Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models. arXiv preprint arXiv:2509.06040, 2025. 2
-
[19]
Rich human feedback for text-to-image generation
Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, et al. Rich human feedback for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19401–19411, 2024. 1, 2
2024
-
[20]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2
2023
-
[22]
Henglin Liu, Huijuan Huang, Jing Wang, Chang Liu, Xiu Li, and Xiangyang Ji. Diversegrpo: Mitigating mode collapse in image generation via diversity-aware grpo.arXiv preprint arXiv:2512.21514, 2025. 2
-
[23]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Videodpo: Omni- preference alignment for video diffusion generation
Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 1
2025
-
[26]
Evalcrafter: Benchmarking and eval- uating large video generation models
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and eval- uating large video generation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22139–22149, 2024. 2
2024
-
[27]
Re- thinking cross-modal interaction in multimodal diffusion transformers
Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, and Kwan-Yee K Wong. Re- thinking cross-modal interaction in multimodal diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5934–5943, 2025. 2
2025
-
[28]
Hpsv3: Towards wide-spectrum human preference score
Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025. 6
2025
-
[29]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[30]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 6
2021
-
[32]
Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023. 2
2023
-
[33]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2
2022
-
[34]
Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 1, 2
2022
-
[35]
Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 6
2022
-
[36]
Diffusion model align- ment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024. 1, 2
2024
-
[37]
Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xin- tao Wang, et al. Grpo-guard: Mitigating implicit over- optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025. 2
-
[38]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024. 1, 2
-
[40]
Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine- tuning.arXiv preprint arXiv:2505.03318, 2025. 1, 2
-
[41]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hong- sheng Li. Better aligning text-to-image models with human preference.arXiv preprint arXiv:2303.14420, 1(3), 2023. 6
-
[43]
Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 1, 2, 6
2023
-
[44]
Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shu- run Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Using human feedback to fine-tune diffusion models without any reward model
Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941– 8951, 2024. 2
2024
-
[47]
A simple yet effective multi-modal reward model.arXiv preprint arXiv:2501.12368, 1(2), 2025
Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, et al. A simple yet effective multi-modal reward model.arXiv preprint arXiv:2501.12368, 1(2), 2025. 1
-
[48]
Alignedgen: Aligning style across gener- ated images.Advances in Neural Information Processing Systems, 38:98168–98192, 2026
Jiexuan Zhang, Yiheng Du, Qian Wang, Weiqi Li, Yu Gu, and Jian Zhang. Alignedgen: Aligning style across gener- ated images.Advances in Neural Information Processing Systems, 38:98168–98192, 2026. 2
2026
-
[49]
Learning multi- dimensional human preference for text-to-image generation
Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingt- ing Gao, Di Zhang, and Zhongyuan Wang. Learning multi- dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8018–8027, 2024. 1, 2, 6
2024
-
[50]
Tao Zhang, Cheng Da, Kun Ding, Huan Yang, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, and Chun- hong Pan. Diffusion model as a noise-aware latent reward model for step-level preference optimization.arXiv preprint arXiv:2502.01051, 2025. 2
-
[51]
Xuanyu Zhang, Weiqi Li, Shijie Zhao, Junlin Li, Li Zhang, and Jian Zhang. Vq-insight: Teaching vlms for ai-generated video quality understanding via progressive visual reinforce- ment learning.arXiv preprint arXiv:2506.18564, 2025. 1, 2 DRM: Diffusion-based Reward Model With Step-wise Guidance Supplementary Material a photo of a black kite and a green bea...
-
[52]
This bench- mark contains 553 prompts designed to test compositional understanding, including object counting, spatial relation- ships, and attribute binding
Result of GenEval To quantitatively assess text-image alignment, we evalu- ate our method on the GenEval benchmark. This bench- mark contains 553 prompts designed to test compositional understanding, including object counting, spatial relation- ships, and attribute binding. We apply our Step-GRPO to the SD3.5-M model and compare it against several strong ...
-
[53]
More Visualization Result The prompts in Figure S2 are as follows:
-
[54]
16-year-old teenager wearing a white bear-ear hat with a smirk on their face
-
[55]
photo of well done salmon dinner, 8K, Global Il- lumination, Ray Tracing Reflections
-
[56]
A lemon with a McDonald’s hat
-
[57]
The image is a mixed media collage with broken glass and torn paper elements, featuring intricate oil details and a canvas texture, in a contemporary art style
-
[58]
Kiwi fruit, mint leaves, ice cubes, background yellow, splashing water, soft box, back light, cre- ative food photography, Art by Alberto Seveso,
-
[59]
little tiny cub beautiful light color White fox soft fur kawaii chibi Walt Disney style, beautiful smiley face and beautiful eyes sweet and smiling features, snuggled in its soft and soft pastel pink cover, mag- ical light background, style Thomas kinkade Nadja Baxter Anne Stokes Nancy Noel realistic
-
[60]
185764, ink art, Calligraphy, bamboo plant :: or- ange, teal, white, black –ar 2:3 –uplight
-
[61]
The sunglasses have a deep black frame with bright pink lenses
A 3D Rendering of a cockatoo wearing sun- glasses. The sunglasses have a deep black frame with bright pink lenses. Fashion photography, volu- metric lighting, CG rendering
-
[62]
A rock formation in the shape of a horse, in- sanely detailed
-
[63]
a desert in a snowglobe, 4k, octane render :: cinematic –ar 2048:858
2048
-
[64]
watercolour beaver with tale, white background Model Overall↑ Single Obj.↑Two Obj.↑Counting↑Colors↑Position↑Attr. Binding↑ Flow Matching Models FLUX.1 Dev 0.66 0.98 0.81 0.74 0.79 0.22 0.45 SD3.5-L 0.71 0.98 0.89 0.73 0.83 0.34 0.47 SD3.5-M 0.63 0.98 0.78 0.50 0.81 0.24 0.52 GRPO based Methods SD3.5-M+Step-GRPO 0.78 0.99 0.93 0.80 0.86 0.37 0.70 Table S1....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.