pith. machine review for the scientific record. sign in

arxiv: 2605.12112 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords RLHFflow matchingperceptual entropydiversity collapsetext-to-imagepolicy gradiententropy regularizationgenerative models
0
0 comments X

The pith

In flow-based RLHF for text-to-image models, policy entropy remains constant while perceptual diversity collapses, requiring perceptual entropy constraints to restore balance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that RLHF fine-tuning of flow-matching text-to-image models causes diversity to collapse in the space of human perception. Standard policy entropy fails to detect or prevent this because it stays fixed due to the unchanging noise schedule in flow models. The collapse stems instead from policy gradients that concentrate probability on high-reward modes. Perceptual entropy is defined to measure spread in perceptual feature spaces and is used to add regularization terms that keep the model from narrowing too much. Experiments on multiple models and rewards confirm that these perceptual constraints improve the trade-off between image quality and variety.

Core claim

The central discovery is that policy entropy in flow models does not change during RLHF because of the fixed noise schedule, even as the generated images lose perceptual diversity through mode-seeking optimization. Perceptual entropy is proposed as an alternative that tracks diversity in perceptual spaces while satisfying entropy axioms. This leads to two methods, Perceptual Entropy Constraint and Perceptual Constraints on Generation Space, which are shown to enhance both quality and diversity metrics in practice.

What carries the argument

Perceptual entropy, computed as the entropy of the policy distribution projected or evaluated in a perceptual embedding space.

If this is right

  • Perceptual Entropy Constraint yields the highest quality-diversity score of 0.734 compared to the baseline of 0.366.
  • Combining it with other constraints achieves an average diversity of 0.989 against the baseline's 0.047.
  • The gains hold consistently for two different base models, both neural and rule-based reward models, and three perceptual spaces.
  • Standard policy entropy regularization cannot prevent the model from converging to narrow high-reward regions in perceptual space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar constant-entropy behavior may limit standard regularization in other flow or diffusion generative models during preference alignment.
  • Extending perceptual spaces to include semantic or stylistic features could further tailor diversity preservation to specific applications.

Load-bearing premise

The chosen perceptual spaces will continue to reflect meaningful diversity that aligns with human preferences across different tasks and models without causing unintended reductions in other qualities.

What would settle it

Training the model with the proposed perceptual constraints and measuring no corresponding increase in diversity according to independent perceptual metrics or human studies would show that the constraints do not achieve the intended preservation.

Figures

Figures reproduced from arXiv: 2605.12112 by Bin-Bin Gao, Chengjie Wang, Feng Zheng, Hongsong Wang, Jun Liu, Xiaofeng Tan, Xi Jiang, Yuanting Fan.

Figure 1
Figure 1. Figure 1: Diversity Collapse. Flow-GRPO improves quality but sacrifices diversity, producing similar styles. However, revisiting this intuition under flow matching reveals a key inconsistency: diversity collapses while the policy entropy surrogate remains constant, as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training-time Metrics. (a) Entropy vs. variance in VAE space. (b) Variance in pixel and reward space. Beyond the theoretical analysis, we empirically verify the constant policy entropy property in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Optimization Behavior. GRPO shows mode￾seeking behavior and converges to a single-peak distri￾bution with high reward, while an ideal model should encourage coverage of multiple high-reward regions. 0 200 400 600 800 Training Step 1.0 2.0 £ 1 0 ¡ 2 Std(R(© xªK k )) Var(© xªK k ) 0.75 1.50 £ 1 0 ¡ 2 0 200 400 600 800 Training Step 0.3 0.4 0.5 max© pA(x k)ªK k = 1 min© pA(x k)ªK k = 1 [PITH_FULL_IMAGE:figur… view at source ↗
Figure 5
Figure 5. Figure 5: Relationship between Reward and Perceptual Entropy. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Optimization Comparison. (a) Native KL (red) versus KL-as￾reward (blue), showing reward and entropy dynamics; (b) entropy with (blue) and without (red) JPCVAE. As illustrated in Fig.6, our method effectively mitigates these optimization issues. Com￾pared to Native KL, which introduces competing gradients, perceptual reward shaping aligns perceptual constraints with the primary objective and stabilizes trai… view at source ↗
Figure 7
Figure 7. Figure 7: Training Dynamics of Flow-GRPO, Flow￾GRPO with KL, and PEC with Perceptual Entropy. (a) Mean reward. (b) Reward standard deviation. Trade-Off between Diversity and Quality [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

RLHF is widely used to align flow-matching text-to-image models with human preferences, but often leads to severe diversity collapse after fine-tuning. In RL, diversity is often assumed to correlate with policy entropy, motivating entropy regularization. However, we show this intuition breaks in flow models: policy entropy remains constant, even while perceptual diversity collapses. We explain this mismatch both theoretically and empirically: the constant entropy arises from the fixed, pre-defined noise schedule, while the diversity collapse is driven by the mode-seeking nature of policy gradients. As a result, policy entropy fails to prevent the model from converging to a narrow high-reward region in the perceptual space. To this end, we introduce perceptual entropy that captures diversity in a perceptual space and maintains the property of standard entropy. Building upon this insight, we propose two entropy-regularized strategies, Perceptual Entropy Constraint and Perceptual Constraints on Generation Space, to preserve perceptual diversity and improve the quality. Experiments across two base models, neural and rule-based rewards, and three perceptual spaces demonstrate consistent gains in the quality-diversity trade-off; PEC achieves the best overall score of 0.734 (vs. baseline's 0.366); a complementary setting of PEC further reaches a diversity average of 0.989 (vs. baseline's 0.047). Our project page (https://xiaofeng-tan.github.io/projects/PEC) is publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that standard policy entropy regularization fails to preserve diversity in flow-matching text-to-image RLHF because policy entropy remains constant (due to the fixed pre-defined noise schedule) while perceptual diversity collapses (due to mode-seeking policy gradients). It introduces perceptual entropy as a diversity measure in perceptual spaces that preserves standard entropy properties, and proposes two regularization strategies—Perceptual Entropy Constraint (PEC) and Perceptual Constraints on Generation Space—to improve the quality-diversity trade-off, with experiments across two base models, neural/rule-based rewards, and three perceptual spaces reporting gains such as 0.734 vs. baseline 0.366 overall score and 0.989 vs. 0.047 diversity average.

Significance. If the claimed mismatch between constant policy entropy and collapsing perceptual diversity is rigorously established and the perceptual entropy constraints generalize without side effects, the work would be significant for RLHF in flow-based generative models, offering a targeted fix for diversity collapse that standard entropy regularization cannot address. The multi-setting empirical evaluation and public project page provide some support for practical utility and reproducibility.

major comments (2)
  1. [Abstract / theoretical analysis] Abstract and theoretical explanation: The central claim that policy entropy remains constant due to the fixed noise schedule does not engage with the change-of-variables formula for differential entropy under a learned flow. In flow-matching models the output entropy equals base entropy plus the integral of the divergence of the time-dependent vector field along trajectories; RLHF updates to the vector field should therefore alter output entropy unless the divergence is provably invariant, which is not shown.
  2. [Method] Method section on perceptual entropy: The definition of perceptual entropy is stated to 'capture diversity in a perceptual space and maintain the property of standard entropy,' yet no explicit functional form, proof of non-negativity or concavity, or comparison to the standard differential entropy appears. Without these, it is unclear whether the quantity is independent of the reward model or reduces to a fitted heuristic, weakening the justification for the proposed constraints.
minor comments (2)
  1. [Experiments] Experiments: Reported aggregate scores (0.734 vs. 0.366; 0.989 vs. 0.047) are given without error bars, number of random seeds, or statistical tests, making it impossible to judge whether the quality-diversity improvements are reliable.
  2. [Abstract] Abstract: The two proposed strategies are named but not briefly characterized, which would improve immediate readability for readers scanning the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications on our theoretical analysis and definitions. Where the comments identify gaps in rigor or explicitness, we indicate that revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / theoretical analysis] Abstract and theoretical explanation: The central claim that policy entropy remains constant due to the fixed noise schedule does not engage with the change-of-variables formula for differential entropy under a learned flow. In flow-matching models the output entropy equals base entropy plus the integral of the divergence of the time-dependent vector field along trajectories; RLHF updates to the vector field should therefore alter output entropy unless the divergence is provably invariant, which is not shown.

    Authors: We appreciate the referee's observation regarding the change-of-variables formula. In our analysis, 'policy entropy' specifically denotes the entropy of the fixed base noise distribution at the input to the flow, which is invariant because the noise schedule is pre-defined and unchanged by RLHF updates to the vector field. The perceptual diversity collapse, by contrast, arises from the mode-seeking behavior of the policy gradient in the output space. We acknowledge that a full accounting of output differential entropy must incorporate the divergence integral. In the revised manuscript we will expand the theoretical section to explicitly reference the change-of-variables formula, derive the relationship between input entropy and perceptual-space diversity under our RLHF objective, and discuss why the divergence term does not counteract the observed mode collapse in practice. This will strengthen the justification for why standard policy-entropy regularization is insufficient. revision: yes

  2. Referee: [Method] Method section on perceptual entropy: The definition of perceptual entropy is stated to 'capture diversity in a perceptual space and maintain the property of standard entropy,' yet no explicit functional form, proof of non-negativity or concavity, or comparison to the standard differential entropy appears. Without these, it is unclear whether the quantity is independent of the reward model or reduces to a fitted heuristic, weakening the justification for the proposed constraints.

    Authors: We thank the referee for noting the need for greater formality. Perceptual entropy is defined as the differential entropy of the push-forward distribution of generated samples under a fixed perceptual embedding map (e.g., CLIP or VGG features). Its explicit form is H_perp = -E_{x~p_gen}[log p_perp(f(x))], where f is the embedding function and p_perp is the density estimated in the perceptual space; because the embedding is fixed and independent of the reward model, the quantity inherits non-negativity and concavity from standard differential entropy on the embedded manifold. In the revised method section we will state this functional form explicitly, include a short derivation confirming the inherited properties, and add a comparison paragraph showing that it reduces to ordinary differential entropy when the perceptual map is the identity. These additions will clarify that the measure is not a fitted heuristic and directly supports the proposed PEC and generation-space constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on model structure and new perceptual measure rather than self-referential fits or citations

full rationale

The paper's derivation of constant policy entropy from the fixed noise schedule in flow-matching models is presented as a direct consequence of the generative process definition (base distribution plus learned transport), not a renaming or fit of the target diversity quantity. Perceptual entropy is introduced as an explicit new functional on chosen perceptual embeddings that is stated to preserve entropy-like properties; this is a definitional extension rather than a reduction of the output to the input by construction. No load-bearing self-citation chains, uniqueness theorems from the same authors, or fitted parameters relabeled as predictions appear in the abstract or described chain. Experiments across models and spaces supply independent empirical content. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about flow model dynamics and introduces perceptual entropy as a new measure; insufficient detail to list specific fitted parameters.

axioms (2)
  • domain assumption Policy entropy in flow models remains constant due to the fixed pre-defined noise schedule
    Invoked to explain why standard entropy regularization fails to prevent diversity collapse.
  • domain assumption Diversity collapse is driven by the mode-seeking nature of policy gradients
    Used to account for the observed perceptual narrowing despite constant entropy.
invented entities (1)
  • perceptual entropy no independent evidence
    purpose: Captures diversity in a perceptual space while maintaining the mathematical properties of standard entropy
    New quantity introduced to replace or augment policy entropy for regularization in flow-based RLHF.

pith-pipeline@v0.9.0 · 5579 in / 1551 out tokens · 143469 ms · 2026-05-13T05:45:26.678221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 22 internal anchors

  1. [1]

    Average-reward soft actor-critic,

    Jacob Adamczyk, V olodymyr Makarenko, Stas Tiomkin, and Rahul V Kulkarni. Average-reward reinforcement learning with entropy regularization.arXiv preprint arXiv:2501.09080, 2025

  2. [2]

    Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

  3. [3]

    Benchmarking diversity in image generation via attribute- conditional human evaluation.arXiv preprint arXiv:2511.10547, 2025

    Isabela Albuquerque, Ira Ktena, Olivia Wiles, Ivana Kaji´c, Amal Rannen-Triki, Cristina Vas- concelos, and Aida Nematzadeh. Benchmarking diversity in image generation via attribute- conditional human evaluation.arXiv preprint arXiv:2511.10547, 2025

  4. [4]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. OpenAI Technical Report, 2023

  5. [5]

    Forking paths in neural text generation.arXiv preprint arXiv:2412.07961, 2024

    Eric Bigelow, Ari Holtzman, Hidenori Tanaka, and Tomer Ullman. Forking paths in neural text generation.arXiv preprint arXiv:2412.07961, 2024

  6. [6]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

  7. [7]

    Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

    Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

  8. [8]

    Emerging Properties in Self-Supervised Vision Transformers.arXiv:2104.14294 [cs], May 2021

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers.CoRR, abs/2104.14294, 2021. URLhttps://arxiv.org/abs/2104.14294

  9. [9]

    DIME: Diffusion-based maximum entropy reinforcement learning

    Onur Celik, Zechu Li, Denis Blessing, Ge Li, Daniel Palenicek, Jan Peters, Georgia Chalvatzaki, and Gerhard Neumann. DIME: Diffusion-based maximum entropy reinforcement learning. InProceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pages 6958–6977. PMLR, 13–19 Jul 2025. URL https:/...

  10. [10]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  11. [11]

    Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

    Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

  12. [12]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

  13. [13]

    Flow density control: Generative optimization beyond entropy-regularized fine-tuning

    Riccardo De Santi, Marin Vlastelica, Ya-Ping Hsieh, Zebang Shen, Niao He, and Andreas Krause. Flow density control: Generative optimization beyond entropy-regularized fine-tuning. Advances in Neural Information Processing Systems, 2025. URL https://openreview.net/ forum?id=ELk7sylH7g

  14. [14]

    Provable maximum entropy manifold exploration via diffusion models

    Riccardo De Santi, Marin Vlastelica, Ya-Ping Hsieh, Zebang Shen, Niao He, and Andreas Krause. Provable maximum entropy manifold exploration via diffusion models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 12739–12761. PMLR, 13–19 Jul 2025. URL https: //proceedings.ml...

  15. [15]

    Aesthetic-Predictor-v2-5: Siglip-based aesthetic score predictor

    Discus0434. Aesthetic-Predictor-v2-5: Siglip-based aesthetic score predictor. https:// github.com/discus0434/aesthetic-predictor-v2-5, 2024

  16. [16]

    Image generation diversity issues and how to tame them

    Mischa Dombrowski, Weitong Zhang, Sarah Cechnicka, Hadrien Reynaud, and Bernhard Kainz. Image generation diversity issues and how to tame them. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3029–3039, 2025

  17. [17]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024. 10

  18. [18]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learning, 2024

  19. [19]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

  20. [20]

    The vendi score: A diversity evaluation metric for machine learning.Transactions on Machine Learning Research, 2023

    Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning.Transactions on Machine Learning Research, 2023

  21. [21]

    Dynamic-TreeRPO: Breaking the independent trajectory bottleneck with structured sampling.arXiv preprint arXiv:2509.23352, 2025

    Xiaolong Fu, Lichen Ma, Zipeng Guo, Gaojing Zhou, Chongxiao Wang, ShiPing Dong, Shizhe Zhou, Ximan Liu, Jingling Fu, Tan Lit Sin, et al. Dynamic-treerpo: Breaking the independent trajectory bottleneck with structured sampling.arXiv preprint arXiv:2509.23352, 2025

  22. [22]

    Flow matching policy with entropy regularization.arXiv preprint arXiv:2603.17685, 2026

    Ting Gao, Stavros Orfanoudakis, Nan Lin, and Elvin Isufi. Flow matching policy with entropy regularization.arXiv preprint arXiv:2603.17685, 2026

  23. [23]

    Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. InInterna- tional Conference on Learning Representations (ICLR), 2019

  24. [24]

    Reinforcement learning with deep energy-based policies

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. InInternational conference on machine learning, pages 1352–1361. PMLR, 2017

  25. [25]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. PMLR, 2018

  26. [26]

    Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324,

    Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324, 2025

  27. [27]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY , USA, 2020. Curran Associates Inc. ISBN 9781713829546

  28. [28]

    Reinforce++: A simple and efficient approach for aligning large language models

    Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv e-prints, pages arXiv–2501, 2025

  29. [29]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  30. [30]

    Towards efficient exact optimization of language model alignment.arXiv preprint arXiv:2402.00856, 2024

    Haozhe Ji, Cheng Lu, Yilin Niu, Pei Ke, Hongning Wang, Jun Zhu, Jie Tang, and Minlie Huang. Towards efficient exact optimization of language model alignment.arXiv preprint arXiv:2402.00856, 2024

  31. [31]

    Rethinking entropy regularization in large reasoning models.arXiv preprint arXiv:2509.25133, 2025

    Yuxian Jiang, Yafu Li, Guanxu Chen, Dongrui Liu, Yu Cheng, and Jing Shao. Rethinking entropy regularization in large reasoning models.arXiv preprint arXiv:2509.25133, 2025

  32. [32]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

  33. [33]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  34. [34]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025

  35. [35]

    arXiv preprint arXiv:2509.06040 (2025) 2, 3

    Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

  36. [36]

    Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Fang, Xiaobo Wang, Wenhao Wang, Zhenyu Yang, Jiawei Li, Xianfang Shi, Hao Zhang, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024. 11

  37. [37]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InIEEE Conference on Computer Vision and Pattern Recognition, 2023

  38. [38]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  39. [39]

    Alignment of diffusion models: Fundamentals, challenges, and future

    Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, and Zeke Xie. Alignment of diffusion models: Fundamentals, challenges, and future. ACM Computing Surveys, 2026

  40. [40]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  41. [41]

    Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025b

    Miao Liu, Shizhe Diao, Xiaotian Lu, Jiachen Hu, Xishan Dong, Yejin Choi, Jan Kautz, and Yuxiao Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025

  42. [42]

    Safetydpo: Scalable safety alignment for text-to-image generation.arXiv preprint arXiv:2412.10493, 2024

    Runtao Liu, Chen I Chieh, Jindong Gu, Jipeng Zhang, Renjie Pi, Qifeng Chen, Philip Torr, Ashkan Khakzar, and Fabio Pizzati. Safetydpo: Scalable safety alignment for text-to-image generation.arXiv preprint arXiv:2412.10493, 2024

  43. [43]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  44. [44]

    Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Liang Zhao, et al. Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

  45. [45]

    Entropy-controlled flow matching.arXiv preprint arXiv:2602.22265, 2026

    Chika Maduabuchi. Entropy-controlled flow matching.arXiv preprint arXiv:2602.22265, 2026

  46. [46]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  47. [47]

    Barron, and Ben Mildenhall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. InInternational Conference on Learning Representations, 2023

  48. [48]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  49. [49]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022

  50. [50]

    Ren, Justin Dinh, and Hanjun Dai

    Allen Z. Ren, Justin Dinh, and Hanjun Dai. Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2408.18257, 2024

  51. [51]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

  52. [52]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  53. [53]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

  54. [54]

    Exploring data scaling trends and effects in reinforcement learning from human feedback

    Wei Shen, Guanlin Liu, Zheng Wu, Ruofei Zhu, Qingping Yang, Chao Xin, Yu Yue, and Lin Yan. Exploring data scaling trends and effects in reinforcement learning from human feedback. arXiv preprint arXiv:2503.22230, 2025

  55. [55]

    Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=PxTIG12RRHS. 12

  56. [56]

    Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.Technical report, 2024

    Kolors Team. Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.Technical report, 2024. URL https://github.com/Kwai-Kolors/Kolors/. Project release; no arXiv identifier is listed in the official repository

  57. [57]

    Understanding reinforcement learning-based fine-tuning of diffusion models: A tutorial and review.arXiv preprint arXiv:2407.13734, 2024

    Masatoshi Uehara, Yulai Zhao, Tommaso Biancalani, and Sergey Levine. Understanding reinforcement learning-based fine-tuning of diffusion models: A tutorial and review.arXiv preprint arXiv:2407.13734, 2024

  58. [58]

    Entropy-guided sequence weighting for efficient exploration in rl-based llm fine-tuning.arXiv preprint arXiv:2503.22456, 2025

    Abdullah Vanlioglu. Entropy-guided sequence weighting for efficient exploration in rl-based llm fine-tuning.arXiv preprint arXiv:2503.22456, 2025

  59. [59]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  60. [60]

    Coefficients-preserving sampling for reinforcement learning with flow matching

    Feng Wang and Zihao Yu. Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952, 2025

  61. [61]

    GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025

    Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025

  62. [62]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025

  63. [63]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

  64. [64]

    Animatabledreamer: Text-guided non-rigid 3d model generation and reconstruction with canonical score distillation

    Xinzhou Wang, Yikai Wang, Junliang Ye, Fuchun Sun, Zhengyi Wang, Ling Wang, Pengkun Liu, Kai Sun, Xintong Wang, Fangfu Liu, and Bin He. Animatabledreamer: Text-guided non-rigid 3d model generation and reconstruction with canonical score distillation. InEuropean Conference on Computer Vision, 2024

  65. [65]

    Unified Reward Model for Multimodal Understanding and Generation

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025

  66. [66]

    Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. InAdvances in Neural Information Processing Systems, 2023

  67. [67]

    Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8:229–256, 1992

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8:229–256, 1992

  68. [68]

    Qwen-image technical report, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, An Yang, Bin Bai, Bo Zhang, Bowen Zheng, Bowen Yu, Chengpeng Chen, Dayiheng Huang, et al. Qwen-image technical report, 2025

  69. [69]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  70. [70]

    Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer

    Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025

  71. [71]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

  72. [72]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  73. [73]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025. 13

  74. [74]

    Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782,

    Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, and Li Yuan. Gpt-imgeval: A comprehensive benchmark for diagnosing gpt-4o in image generation.arXiv preprint arXiv:2504.02782, 2025

  75. [75]

    Using human feedback to fine-tune diffusion models without any reward model

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

  76. [76]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  77. [77]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

  78. [78]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

  79. [79]

    The surprising effectiveness of negative reinforcement in llm reasoning, 2025.arXiv preprint arXiv:2506.01347, 2025

    Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning, 2025.arXiv preprint arXiv:2506.01347, 2025

  80. [80]

    Maximum entropy inverse reinforcement learning

    Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. InAAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. 14 When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy Supplementary Material w/ PEC (Ours) Flow-GRPO w/ PEC (Ours) Flow-GRPO w/ PE...