hub Mixed citations

Flow matching policy gradients

David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, Angjoo Kanazawa · 2025 · arXiv 2507.21053

Mixed citation behavior. Most common role is background (50%).

14 Pith papers citing it

Background 50% of classified citations

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 2 baseline 1

citation-polarity summary

background 3 use method 2 baseline 1

representative citing papers

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

cs.RO · 2026-05-28 · unverdicted · novelty 7.0

CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.

Generative Actor-Critic with Soft Bridge Policies

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

SoftGAC defines a stochastic bridge from base to action latent that converts the MaxEnt objective into a tractable relative-entropy term reducible to control energy, achieving competitive returns with one-pass sampling.

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

GDSD reduces RL for dLLMs to likelihood-free self-distillation via a normalization-free logit-matching objective, outperforming ELBO methods with more stable training on LLaDA-8B and Dream-7B.

DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies

cs.RO · 2026-05-19 · unverdicted · novelty 6.0

DEFLECT is an offline post-training method that improves async VLA policy success rates under high inference delays by using flow-matching likelihood ratios on counterfactual fresh/stale action pairs from a frozen reference policy.

Video Models Can Reason with Verifiable Rewards

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.

Discrete Flow Matching for Offline-to-Online Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.

Unified Noise Steering for Efficient Human-Guided VLA Adaptation

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.

Positive-Only Drifting Policy Optimization

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

PODPO is a likelihood-free generative policy optimization method for online RL that steers actions to high-return regions using only positive-advantage samples and local contrastive drifting.

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

Driving Intents Amplify Planning-Oriented Reinforcement Learning

cs.RO · 2026-05-12 · unverdicted · novelty 5.0 · 2 refs

DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.

Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

cs.RO · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

cs.LG · 2026-05-04

citing papers explorer

Showing 14 of 14 citing papers.

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance cs.RO · 2026-05-28 · unverdicted · none · ref 7
CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL cs.LG · 2026-05-17 · unverdicted · none · ref 50
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.
Generative Actor-Critic with Soft Bridge Policies cs.LG · 2026-05-09 · unverdicted · none · ref 12
SoftGAC defines a stochastic bridge from base to action latent that converts the MaxEnt objective into a tractable relative-entropy term reducible to control energy, achieving competitive returns with one-pass sampling.
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models cs.LG · 2026-05-28 · unverdicted · none · ref 29
GDSD reduces RL for dLLMs to likelihood-free self-distillation via a normalization-free logit-matching objective, outperforming ELBO methods with more stable training on LLaDA-8B and Dream-7B.
DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies cs.RO · 2026-05-19 · unverdicted · none · ref 17
DEFLECT is an offline post-training method that improves async VLA policy success rates under high inference delays by using flow-matching likelihood ratios on counterfactual fresh/stale action pairs from a frozen reference policy.
Video Models Can Reason with Verifiable Rewards cs.CV · 2026-05-14 · unverdicted · none · ref 27
VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.
Discrete Flow Matching for Offline-to-Online Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 8
DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
Unified Noise Steering for Efficient Human-Guided VLA Adaptation cs.RO · 2026-05-11 · unverdicted · none · ref 48
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think cs.LG · 2026-04-25 · unverdicted · none · ref 27
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
Positive-Only Drifting Policy Optimization cs.LG · 2026-04-15 · unverdicted · none · ref 3
PODPO is a likelihood-free generative policy optimization method for online RL that steers actions to high-return regions using only positive-advantage samples and local contrastive drifting.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling cs.LG · 2026-04-08 · unverdicted · none · ref 44
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
Driving Intents Amplify Planning-Oriented Reinforcement Learning cs.RO · 2026-05-12 · unverdicted · none · ref 14 · 2 links
DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT cs.RO · 2026-05-09 · unverdicted · none · ref 29 · 2 links
ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies cs.LG · 2026-05-04 · unreviewed · ref 161

Flow matching policy gradients

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer