{"total":15,"items":[{"citing_arxiv_id":"2606.02218","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing","primary_cat":"cs.LG","submitted_at":"2026-06-01T13:20:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SAGC dynamically adjusts group sizes in synchronous GRPO and DAPO via online constrained optimization to cut stragglers, improve wall-clock speed, and maintain or improve rewards and downstream reasoning performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28691","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-05-27T16:19:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"OSP-Next reports 83.73% VBench score and up to 2.27x speedup via hybrid sparse attention, SSP parallelism, HiF8 quantization, and Mix-GRPO on diffusion transformers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27736","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Explicit Critic Guidance for Aligning Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-05-26T22:20:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25661","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DRM: Diffusion-based Reward Model With Step-wise Guidance","primary_cat":"cs.CV","submitted_at":"2026-05-25T10:11:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DRM turns a pre-trained diffusion model into a step-wise reward model and uses it for dense RL training (Step-wise GRPO) and guided sampling to improve final image quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15803","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Embedding-perturbed Exploration Preference Optimization for Flow Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T09:56:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"E²PO uses embedding-level perturbations to maintain intra-group variance and discriminative signal in RL-based preference optimization for generative flow models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15458","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video Models Can Reason with Verifiable Rewards","primary_cat":"cs.CV","submitted_at":"2026-05-14T22:40:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12112","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy","primary_cat":"cs.CV","submitted_at":"2026-05-12T13:29:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652-36663, 2023. [33] Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024. [34] Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025. [35] Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025. [36] Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Fang, Xiaobo Wang, Wenhao Wang, Zhenyu Yang, Jiawei Li, Xianfang Shi, Hao Zhang, et al. Hunyuan-dit: A powerful multi-resolution"},{"citing_arxiv_id":"2605.10937","ref_index":105,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:59:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"[103] Haoyou Deng, Keyu Yan, Chaojie Mao, Xiang Wang, Yu Liu, Changxin Gao, and Nong Sang. Densegrpo: From sparse to dense reward for flow matching model alignment.arXiv preprint arXiv:2601.20218, 2026. [104] Zheng Ding and Weirui Ye. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153, 2025. [105] Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025. [106] Xiaolong Fu, Lichen Ma, Zipeng Guo, Gaojing Zhou, Chongxiao Wang, ShiPing Dong, Shizhe Zhou, Ximan Liu, Jingling Fu, Tan Lit Sin, et al."},{"citing_arxiv_id":"2604.25427","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Systematic Post-Train Framework for Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-28T09:34:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"issues of reward sparsity and inaccuracy arising from assigning a single global reward to multi- step SDE trajectories. Along the line of addressing sparse/ambiguous supervision over multi-step trajectories, E-GRPO [27] identifies that only high-entropy steps contribute to effective exploration, and proposes entropy-aware step consolidation with a multi-step group-normalized advantage to improve learning efficiency. BranchGRPO [28] reorganizes the rollout process into a branching tree structure, where shared prefixes reduce computational overhead and pruning eliminates low-reward paths and redundant depths. There are some prior arts [29, 30, 31] working on forward-process policy optimization. 2.3 Autoregressive Visual Generation To circumvent the limitation of bidirectional diffusion models, autoregressive (AR) approaches"},{"citing_arxiv_id":"2604.23380","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think","primary_cat":"cs.LG","submitted_at":"2026-04-25T17:03:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"mization to rollout transition kernels creates a tight coupling between the two stages, limiting implementation flexibility. These limitations invite increasingly elaborate designs to patch the inefficiency and inflexibility of the MDP frame- arXiv:2604.23380v1 [cs.LG] 25 Apr 2026 work. For instance, MixGRPO [19] introduces a hybrid ODE-SDE sampling scheme with a sliding-window sched- ule, while BranchGRPO [21] restructures sampling into a branching tree. Both yield notable improvements, but at the cost of substantially higher algorithmic complexity and more hyperparameters. A simpler yet often overlooked approach is to revisit the variational roots of diffusion models: adopting pretrain- ing objectives closely connected to the diffusion evidence lower bound (ELBO) as tractable surrogates for the model"},{"citing_arxiv_id":"2604.19009","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-21T02:57:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Building on this, FlowGRPO [27] and DanceGRPO [56] extend GRPO-style updates to flow-matching models by converting deterministic ordi- nary differential equation (ODE) sampling into stochastic differential equation (SDE) formulations. This modification introduces exploratory noise that facili- GDMD 5 tates group-wise policy optimization. Subsequent works [11,12,21,22,51] have further refined GRPO-based frameworks to enhance both training efficiency and stability. Despite these advances, recent studies [23,55,59] have identified limita- tions inherent in policy optimization methods that rely on likelihood estimation, such as systematic bias and restricted solver flexibility. In response, Diffusion- NFT[59]proposesanovelapproachthatintegratesreinforcementsignalsdirectly"},{"citing_arxiv_id":"2604.15311","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories","primary_cat":"cs.CV","submitted_at":"2026-04-16T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"flow matching models, Adjoint Matching [5] formulates reward fine-tuning as stochastic optimal control, whereas DiffusionNFT [64] and AWM [54] propose forward-process RL methods. DanceGRPO [55] and Flow-GRPO [29] adapt GRPO [42] to flow matching by converting deterministic ODE sampling into an equivalent SDE formulation and applying the GRPO loss across generation steps. MixGRPO [22] and other GRPO variants [24, 47, 66] further improve efficiency and performance. Unlike the methods above, direct-gradient methods use the differentiability of diffusion and flow matching samplers to propagate reward gradients directly [3, 35, 43, 44, 52, 53, 62]. ReFL [53] randomly selects a timestep near the end of the generation trajectory and uses a one-step leap prediction to estimate the final"},{"citing_arxiv_id":"2604.06916","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling","primary_cat":"cs.LG","submitted_at":"2026-04-08T10:14:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023. [37] Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858-79885, 2023. [38] Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025. [39] Zheng Ding and Weirui Ye. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models, 2025. URL https://arxiv.org/abs/2512."},{"citing_arxiv_id":"2605.02913","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-08T00:53:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04142","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models","primary_cat":"cs.CV","submitted_at":"2026-04-05T15:00:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"proaches based on Group Relative Policy Optimization (GRPO) [22,44] improve arXiv:2604.04142v1 [cs.CV] 5 Apr 2026 2 L. Zhang et al. aesthetic quality, semantic consistency, and text rendering by optimizing group- normalized advantages over diffusion trajectories [6,12,42]. However, despite their alignment effectiveness, GRPO-based methods ex- hibit substantial inefficiency in large-scale flow-matching training [17]. Existing approaches such as Flow-GRPO often require thousands of GPU hours to con- verge, substantially limiting scalability. This inefficiency stems primarily from two factors. First, GRPO follows a strictly on-policy paradigm: it repeatedly samples fresh trajectories under the current policy and discards them at the end of each policy iteration."}],"limit":50,"offset":0}