Flow-DPPO replaces PPO ratio clipping with an asymmetric KL divergence mask for flow models, claiming higher rewards, reduced forgetting, and stable multi-epoch training.
hub Mixed citations
Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952
Mixed citation behavior. Most common role is background (57%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on locomotion and manipulation benchmarks.
DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-process methods.
MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
Pave-GRPO reformulates GRPO via principled average velocity decomposition to enable denser temporal supervision in flow-based generative model alignment without increasing rollout cost.
RTDMD unifies KL minimization to a reward-tilted teacher into distribution matching plus reward terms, using AC-DMD in stage one and hybrid GRPO-style gradients plus SubGRPO in stage two to reach new SOTA on preference, aesthetic, and compositional metrics with 4-step generation on SD3, SD3.5, and F
Flash-GRPO is a one-step GRPO framework for video diffusion alignment that applies iso-temporal grouping and temporal gradient rectification to achieve higher alignment quality and stability than full-trajectory training under low compute budgets on 1.3B-14B models.
RAVEN aligns training and inference for causal autoregressive video diffusion via interleaved rollout repacking and introduces CM-GRPO for direct RL on consistency-model kernels, claiming better quality than recent baselines.
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
AdaGRPO enhances GRPO for flow models via online curriculum filtering of prompts and cross-level advantage fusion, yielding performance gains and training stability.
Precise is a new SDE-consistent stochastic sampler that balances exploration and stability for RL post-training of flow-matching models via a novel posterior-mean approximation.
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
GCPO shifts RL policy optimization for flow matching from step-level to chunk-level grouping of consecutive denoising steps, reporting up to 43% relative gains over GRPO on T2I benchmarks and preference tasks.
citing papers explorer
-
Pave-GRPO: Beyond Instantaneous Guidance through Principled Average Velocity Decomposition
Pave-GRPO reformulates GRPO via principled average velocity decomposition to enable denser temporal supervision in flow-based generative model alignment without increasing rollout cost.
-
Reinforcing Few-step Generators via Reward-Tilted Distribution Matching
RTDMD unifies KL minimization to a reward-tilted teacher into distribution matching plus reward terms, using AC-DMD in stage one and hybrid GRPO-style gradients plus SubGRPO in stage two to reach new SOTA on preference, aesthetic, and compositional metrics with 4-step generation on SD3, SD3.5, and F
-
Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization
Flash-GRPO is a one-step GRPO framework for video diffusion alignment that applies iso-temporal grouping and temporal gradient rectification to achieve higher alignment quality and stability than full-trajectory training under low compute budgets on 1.3B-14B models.
-
RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO
RAVEN aligns training and inference for causal autoregressive video diffusion via interleaved rollout repacking and introduces CM-GRPO for direct RL on consistency-model kernels, claiming better quality than recent baselines.
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
-
AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO
AdaGRPO enhances GRPO for flow models via online curriculum filtering of prompts and cross-level advantage fusion, yielding performance gains and training stability.
-
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
-
A Systematic Post-Train Framework for Video Generation
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
-
Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization
GCPO shifts RL policy optimization for flow matching from step-level to chunk-level grouping of consecutive denoising steps, reporting up to 43% relative gains over GRPO on T2I benchmarks and preference tasks.