{"total":15,"items":[{"citing_arxiv_id":"2605.25661","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DRM: Diffusion-based Reward Model With Step-wise Guidance","primary_cat":"cs.CV","submitted_at":"2026-05-25T10:11:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DRM turns a pre-trained diffusion model into a step-wise reward model and uses it for dense RL training (Step-wise GRPO) and guided sampling to improve final image quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23522","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models","primary_cat":"cs.LG","submitted_at":"2026-05-22T11:37:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Precise is a new SDE-consistent stochastic sampler that balances exploration and stability for RL post-training of flow-matching models via a novel posterior-mean approximation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15803","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Embedding-perturbed Exploration Preference Optimization for Flow Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T09:56:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"E²PO uses embedding-level perturbations to maintain intra-group variance and discriminative signal in RL-based preference optimization for generative flow models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14274","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL","primary_cat":"cs.CV","submitted_at":"2026-05-14T02:18:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12112","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy","primary_cat":"cs.CV","submitted_at":"2026-05-12T13:29:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RLHF aligns generative models with human preferences via reward signals [39]. Building on early progress for diffusion models [ 6, 19, 59, 75], recent work extends RLHF to flow matching by enabling stochastic rollouts and GRPO-style optimization [ 40, 73]. Follow-up studies improve efficiency [21, 26, 35], preference modeling [78], theory [60], and reward hacking [61]. Despite these advances, RLHF for flow models often suffers from diversity collapse, and its underlying mechanism remains unclear. Entropy in LLM Reinforcement Finetuning.Entropy is widely used to characterize exploration and predict downstream gains in RLVR [12, 25, 52]. Token-level analyses further show that high-entropy tokens often correspond to key forking points and drive a disproportionate share of learning [5, 62]."},{"citing_arxiv_id":"2605.11480","ref_index":33,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient Adjoint Matching for Fine-tuning Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-05-12T03:55:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EAM reformulates adjoint matching for diffusion fine-tuning with linear base drift to allow efficient deterministic sampling and closed-form adjoints while matching or exceeding prior performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InNeurIPS, 2023. [31] J. Shin, J. Sul, J. Lee, J. Choi, and J. Choi. Efficient generative modeling beyond memoryless diffusion via adjoint schrödinger bridge matching. InICML, 2026. [32] Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021. [33] J. Wang, J. Liang, J. Liu, H. Liu, G. Liu, J. Zheng, W. Pang, A. Ma, Z. Xie, X. Wang, M. Wang, P. Wan, and X. Liang. GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping. arXiv:2510.22319, 2025. [34] Y . Wang, Z. Li, Y . Zang, Y . Zhou, J. Bu, C. Wang, Q. Lu, C. Jin, and J. Wang. Pref-GRPO: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning."},{"citing_arxiv_id":"2605.10937","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:59:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Despite their effectiveness, such approaches can be compromised by illusory advantage signals, leading to reward hacking. Hong et al. [21] demonstrates that combining multiple reward functions alleviates the problem only to a limited extent, motivating the pursuit of more fundamental improve- ments. By integrating ratio normalization and gradient reweighting, GRPO-Guard [22] alleviates this issue by moderating the clipping mechanism. Pref-GRPO [9] identifies that reward hacking occurs when minimal reward differences between images are exaggerated following normalization, and mitigates this problem through a pairwise preference-based GRPO approach that reformulates the optimization objective from score maximization to preference fitting."},{"citing_arxiv_id":"2605.10730","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Qwen-Image-2.0 Technical Report","primary_cat":"cs.CV","submitted_at":"2026-05-11T15:34:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10983","ref_index":19,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment","primary_cat":"cs.LG","submitted_at":"2026-05-09T04:41:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23380","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think","primary_cat":"cs.LG","submitted_at":"2026-04-25T17:03:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[38] Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization.CVPR, 2024. 2, 5 [39] Feng Wang and Zihao Yu. Coefficients-preserving sam- pling for reinforcement learning with flow matching. arXiv:2509.05952, 2025. 2 [40] Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping. arXiv:2510.22319, 2025. 2 [41] Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding"},{"citing_arxiv_id":"2604.19009","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-21T02:57:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Building on this, FlowGRPO [27] and DanceGRPO [56] extend GRPO-style updates to flow-matching models by converting deterministic ordi- nary differential equation (ODE) sampling into stochastic differential equation (SDE) formulations. This modification introduces exploratory noise that facili- GDMD 5 tates group-wise policy optimization. Subsequent works [11,12,21,22,51] have further refined GRPO-based frameworks to enhance both training efficiency and stability. Despite these advances, recent studies [23,55,59] have identified limita- tions inherent in policy optimization methods that rely on likelihood estimation, such as systematic bias and restricted solver flexibility. In response, Diffusion- NFT[59]proposesanovelapproachthatintegratesreinforcementsignalsdirectly"},{"citing_arxiv_id":"2604.17415","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-04-19T12:47:52+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04142","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models","primary_cat":"cs.CV","submitted_at":"2026-04-05T15:00:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tion of flow dynamics, MixGRPO [15] further generalizes sampling by adopting mixed ODE-SDE trajectory schedules to trade off transport fidelity and stochas- tic diversity during policy optimization. Subsequent work targets robustness of the policy update: BranchGRPO [17] structures trajectory branching to sepa- rate multimodal denoising behavior, while GRPO-Guard [38] analyzes timestep- wise shifts in importance-ratio statistics and introduces ratio-normalization and gradient-reweighting corrections that restore effective clipping and prevent im- plicit over-optimization. Methods such as Fine-Grained GRPO [48] and Neigh- bor GRPO [8] refine credit assignment-via attribute-level decomposition or contrastive neighborhood objectives-to increase alignment granularity and pre-"},{"citing_arxiv_id":"2602.11146","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling","primary_cat":"cs.CV","submitted_at":"2026-02-11T18:57:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.04663","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design","primary_cat":"cs.LG","submitted_at":"2026-02-04T15:36:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An ELBO-based likelihood estimator from the final generated sample dominates other RL design factors for diffusion models, raising GenEval from 0.24 to 0.95 in 90 GPU hours with better efficiency than prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}