hub Mixed citations

Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952

Feng Wang, Zihao Yu · 2025 · arXiv 2509.05952

Mixed citation behavior. Most common role is background (57%).

14 Pith papers citing it

Background 57% of classified citations

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 method 1

citation-polarity summary

background 4 unclear 2 use method 1

representative citing papers

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

Flow-DPPO replaces PPO ratio clipping with an asymmetric KL divergence mask for flow models, claiming higher rewards, reduced forgetting, and stable multi-epoch training.

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.

ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

cs.RO · 2026-04-13 · unverdicted · novelty 7.0

ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on locomotion and manipulation benchmarks.

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

cs.LG · 2025-09-19 · unverdicted · novelty 7.0

DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-process methods.

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

cs.AI · 2025-07-29 · unverdicted · novelty 7.0

MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.

TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL

cs.CV · 2026-06-26 · unverdicted · novelty 6.0

TempAct applies hierarchical planner-executor RL with group exploration and multi-level rewards to improve temporal consistency in autoregressive video models.

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

RTDMD unifies KL minimization to a reward-tilted teacher into distribution matching plus reward terms, using AC-DMD in stage one and hybrid GRPO-style gradients plus SubGRPO in stage two to reach new SOTA on preference, aesthetic, and compositional metrics with 4-step generation on SD3, SD3.5, and F

When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

cs.LG · 2026-05-22 · unverdicted · novelty 5.0

Precise is a new SDE-consistent stochastic sampler that balances exploration and stability for RL post-training of flow-matching models via a novel posterior-mean approximation.

Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers

cs.CV · 2026-05-08 · unverdicted · novelty 5.0

Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.

A Systematic Post-Train Framework for Video Generation

cs.CV · 2026-04-28 · unverdicted · novelty 5.0

A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.

Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization

cs.CV · 2025-10-24 · unverdicted · novelty 5.0

GCPO shifts RL policy optimization for flow matching from step-level to chunk-level grouping of consecutive denoising steps, reporting up to 43% relative gains over GRPO on T2I benchmarks and preference tasks.

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

cs.CV · 2026-05-15

citing papers explorer

Showing 14 of 14 citing papers.

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models cs.LG · 2026-06-09 · unverdicted · none · ref 9
Flow-DPPO replaces PPO ratio clipping with an asymmetric KL divergence mask for flow models, claiming higher rewards, reduced forgetting, and stable multi-epoch training.
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment cs.LG · 2026-05-09 · unverdicted · none · ref 48 · 2 links
TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching cs.RO · 2026-04-13 · unverdicted · none · ref 22
ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on locomotion and manipulation benchmarks.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process cs.LG · 2025-09-19 · unverdicted · none · ref 21
DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-process methods.
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE cs.AI · 2025-07-29 · unverdicted · none · ref 40
MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL cs.CV · 2026-06-26 · unverdicted · none · ref 21
TempAct applies hierarchical planner-executor RL with group exploration and multi-level rewards to improve temporal consistency in autoregressive video models.
Reinforcing Few-step Generators via Reward-Tilted Distribution Matching cs.CV · 2026-05-25 · unverdicted · none · ref 71
RTDMD unifies KL minimization to a reward-tilted teacher into distribution matching plus reward terms, using AC-DMD in stage one and hybrid GRPO-style gradients plus SubGRPO in stage two to reach new SOTA on preference, aesthetic, and compositional metrics with 4-step generation on SD3, SD3.5, and F
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy cs.CV · 2026-05-12 · unverdicted · none · ref 60
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think cs.LG · 2026-04-25 · unverdicted · none · ref 39
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models cs.LG · 2026-05-22 · unverdicted · none · ref 15
Precise is a new SDE-consistent stochastic sampler that balances exploration and stability for RL post-training of flow-matching models via a novel posterior-mean approximation.
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 41
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
A Systematic Post-Train Framework for Video Generation cs.CV · 2026-04-28 · unverdicted · none · ref 24
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization cs.CV · 2025-10-24 · unverdicted · none · ref 14
GCPO shifts RL policy optimization for flow matching from step-level to chunk-level grouping of consecutive denoising steps, reporting up to 43% relative gains over GRPO on T2I benchmarks and preference tasks.
Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization cs.CV · 2026-05-15 · unreviewed · ref 27

Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer