hub Mixed citations

Flow matching policy gradients

David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, Angjoo Kanazawa · 2025 · arXiv 2507.21053

Mixed citation behavior. Most common role is background (50%).

22 Pith papers citing it

Background 50% of classified citations

read on arXiv browse 22 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 2 baseline 1

citation-polarity summary

background 3 use method 2 baseline 1

representative citing papers

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

cs.RO · 2026-05-28 · unverdicted · novelty 7.0

CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.

Explicit Critic Guidance for Aligning Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.

Generative Actor-Critic with Soft Bridge Policies

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

SoftGAC defines a stochastic bridge from base to action latent that converts the MaxEnt objective into a tractable relative-entropy term reducible to control energy, achieving competitive returns with one-pass sampling.

Support-Constrained RL Enables Real-World Policy Improvement without Real-World Experience

cs.RO · 2026-06-25 · unverdicted · novelty 6.0

SCORE constrains sim RL to the support of a real-data policy via flow steering, raising average success on eight dexterous tasks from 37.8% to 89.9%.

Transferring Contact, Not Just Motion: Compliant Grasping Across Dexterous Hands

cs.RO · 2026-06-14 · unverdicted · novelty 6.0

A cross-embodiment force-position interface with system-identified torque calibration enables a flow-matching policy to perform transferable compliant grasping on heterogeneous dexterous hands.

DiPOD: Diffusion Policy Optimization without Drifting Apart

cs.LG · 2026-06-11 · unverdicted · novelty 6.0

DiPOD stabilizes diffusion policy optimization by interleaving self-distillation with gradient updates via an on-policy ELBO regularizer, yielding more stable training and higher rewards than prior methods.

GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

GenPO++ achieves exact Jacobian-free likelihood ratio computation for generative flow policies by embedding history states as auxiliary memory in a high-order reversible ODE solver.

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

GDSD reduces RL for dLLMs to likelihood-free self-distillation via a normalization-free logit-matching objective, outperforming ELBO methods with more stable training on LLaDA-8B and Dream-7B.

Adversarial Dual On-Policy Distillation from Expressive Teacher

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

FA-OPD co-trains a flow-matching teacher and MLP student via adversarial dual on-policy distillation, improving robustness over baselines on six robot benchmarks with noisy or limited demonstrations.

DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies

cs.RO · 2026-05-19 · unverdicted · novelty 6.0

DEFLECT is an offline post-training method that improves async VLA policy success rates under high inference delays by using flow-matching likelihood ratios on counterfactual fresh/stale action pairs from a frozen reference policy.

Video Models Can Reason with Verifiable Rewards

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.

Discrete Flow Matching for Offline-to-Online Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.

Unified Noise Steering for Efficient Human-Guided VLA Adaptation

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

cs.LG · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

OGPO enables sample-efficient full-finetuning of generative control policies via off-policy critics and modified PPO, achieving SOTA on robot manipulation tasks while rescuing poorly initialized behavior cloning policies without expert data.

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.

Positive-Only Drifting Policy Optimization

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

PODPO is a likelihood-free generative policy optimization method for online RL that steers actions to high-return regions using only positive-advantage samples and local contrastive drifting.

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

Driving Intents Amplify Planning-Oriented Reinforcement Learning

cs.RO · 2026-05-12 · unverdicted · novelty 5.0 · 2 refs

DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.

Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

cs.RO · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.

ReFPO: Reflow Regularization for Flow Matching Policy Gradients

cs.RO · 2026-06-19 · unverdicted · novelty 4.0

ReFPO adds explicit Reflow regularization to FPO, stabilizing PPO-style training and supporting high-fidelity one-step inference across GridWorld, MuJoCo, and Humanoid tasks.

citing papers explorer

Showing 22 of 22 citing papers after filters.

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning cs.LG · 2026-06-09 · unverdicted · none · ref 43
QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.
Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance cs.RO · 2026-05-28 · unverdicted · none · ref 7
CGPO integrates training-free critic guidance into diffusion denoising to produce high-Q actions as regression targets, yielding SOTA results on MuJoCo locomotion and successful Franka arm grasping.
Explicit Critic Guidance for Aligning Diffusion Models cs.LG · 2026-05-26 · unverdicted · none · ref 51
Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL cs.LG · 2026-05-17 · unverdicted · none · ref 50
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.
Generative Actor-Critic with Soft Bridge Policies cs.LG · 2026-05-09 · unverdicted · none · ref 12
SoftGAC defines a stochastic bridge from base to action latent that converts the MaxEnt objective into a tractable relative-entropy term reducible to control energy, achieving competitive returns with one-pass sampling.
Support-Constrained RL Enables Real-World Policy Improvement without Real-World Experience cs.RO · 2026-06-25 · unverdicted · none · ref 9
SCORE constrains sim RL to the support of a real-data policy via flow steering, raising average success on eight dexterous tasks from 37.8% to 89.9%.
Transferring Contact, Not Just Motion: Compliant Grasping Across Dexterous Hands cs.RO · 2026-06-14 · unverdicted · none · ref 9
A cross-embodiment force-position interface with system-identified torque calibration enables a flow-matching policy to perform transferable compliant grasping on heterogeneous dexterous hands.
DiPOD: Diffusion Policy Optimization without Drifting Apart cs.LG · 2026-06-11 · unverdicted · none · ref 12
DiPOD stabilizes diffusion policy optimization by interleaving self-distillation with gradient updates via an on-policy ELBO regularizer, yielding more stable training and higher rewards than prior methods.
GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios cs.LG · 2026-06-05 · unverdicted · none · ref 29
GenPO++ achieves exact Jacobian-free likelihood ratio computation for generative flow policies by embedding history states as auxiliary memory in a high-order reversible ODE solver.
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models cs.LG · 2026-05-28 · unverdicted · none · ref 29
GDSD reduces RL for dLLMs to likelihood-free self-distillation via a normalization-free logit-matching objective, outperforming ELBO methods with more stable training on LLaDA-8B and Dream-7B.
Adversarial Dual On-Policy Distillation from Expressive Teacher cs.LG · 2026-05-26 · unverdicted · none · ref 10
FA-OPD co-trains a flow-matching teacher and MLP student via adversarial dual on-policy distillation, improving robustness over baselines on six robot benchmarks with noisy or limited demonstrations.
DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies cs.RO · 2026-05-19 · unverdicted · none · ref 17
DEFLECT is an offline post-training method that improves async VLA policy success rates under high inference delays by using flow-matching likelihood ratios on counterfactual fresh/stale action pairs from a frozen reference policy.
Video Models Can Reason with Verifiable Rewards cs.CV · 2026-05-14 · unverdicted · none · ref 27
VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.
Discrete Flow Matching for Offline-to-Online Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 8
DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
Unified Noise Steering for Efficient Human-Guided VLA Adaptation cs.RO · 2026-05-11 · unverdicted · none · ref 48
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies cs.LG · 2026-05-04 · unverdicted · none · ref 28 · 2 links
OGPO enables sample-efficient full-finetuning of generative control policies via off-policy critics and modified PPO, achieving SOTA on robot manipulation tasks while rescuing poorly initialized behavior cloning policies without expert data.
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think cs.LG · 2026-04-25 · unverdicted · none · ref 27
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
Positive-Only Drifting Policy Optimization cs.LG · 2026-04-15 · unverdicted · none · ref 3
PODPO is a likelihood-free generative policy optimization method for online RL that steers actions to high-return regions using only positive-advantage samples and local contrastive drifting.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling cs.LG · 2026-04-08 · unverdicted · none · ref 44
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
Driving Intents Amplify Planning-Oriented Reinforcement Learning cs.RO · 2026-05-12 · unverdicted · none · ref 14 · 2 links
DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT cs.RO · 2026-05-09 · unverdicted · none · ref 29 · 2 links
ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.
ReFPO: Reflow Regularization for Flow Matching Policy Gradients cs.RO · 2026-06-19 · unverdicted · none · ref 24
ReFPO adds explicit Reflow regularization to FPO, stabilizing PPO-style training and supporting high-fidelity one-step inference across GridWorld, MuJoCo, and Humanoid tasks.

Flow matching policy gradients

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer